DataTaunew | comments | leaders | submitlogin
1 point by patientfrog 3135 days ago | link | parent

Not sure about a lot of details of what you are dealing with, so of course take my response with a big grain of salt. Still, from what it sounds like, you are in a good place and though you may make some mistakes but it will be a great learning experience.

There are many ways you might want to approach it. Netflix or Google's use cases might be very different than your company's (scale issues aside). First thing to do would be a conversation with the team about what the needs are both now and what they will be in the future.

That said, a good, general, thing to do with your data no matter what it looks like is put it somewhere you would call an archive. My suggestion is S3 since it is cheap (likely cheaper than using an EC2 instance as a storage device), easy to use (=low maintenance), and there are plenty of analytics tools that can read from S3 out-of-the-box like Spark or even EMR. When things get too big or expensive it might be time to move things into HDFS or a more mature file storage system.

As you get a better sense for what types of queries and ETL jobs are common, you might consider a database to store pre-processed data. AWS has lots of options to help you get started that are expensive, but save time if you need to solve problems fast.

The good news is that lots of people have solved these issues before so there are plenty of resources and tools available as your company scales.

It is also worth noting that building, maintaining, and scaling a data infrastructure can be a full-time job-- you might spend many months working on it and not get to do much analysis. Try to set things up so that you are not the only one who "owns" the data infrastructure.




RSS | Announcements