DataTau | Ask DT: How do I start building a data infrastructure at a startup?

DataTau

	Ask DT: How do I start building a data infrastructure at a startup?
	8 points by I_am_lost 3125 days ago \| 5 comments
	Hi all, I am in a bit of a pickle, I'm afraid. A few months ago, I was hired to do Data Engineering for a small startup. When I came on-board, I discovered that the data infrastructure consists of undocumented bash scripts and large flat files (mostly csv) on a server. I only have 1 year of experience in engineering and I am a bit lost. Because of the lack of infrastructure, most of my time is spent extracting and cleaning data and I often barely have any time to do modelling or any sort of 'analysis'. I would like to build out a good infrastructure for them, but my problem is: there are too many unknowns and I am not experienced enough. My immediate superior does not have a background in data science and is happy with flat files (which I personally don't agree with). Can someone share some stories from the trenches/help me out? I'd like to see this as an opportunity to challenge myself and grow, but I'm afraid that my newbieness will do more harm than good. Thank you.

3 points by skadamat 3124 days ago | link

Hey, you'll have to do some trial & error, but the best thing is to read some posts on war stories:

500px analytics guy: https://medium.com/@samson_hu/building-analytics-at-500px-92...

Insight's data eng blog - http://insightdataengineering.com/blog/

Search on Google for the phrase "Building data pipelines" : http://www.slideshare.net/g33ktalk/data-pipeline-acial-lyceu... https://metamarkets.com/2014/building-a-data-pipeline-that-h... http://www.bluedata.com/blog/2015/06/apache-spark-how-to-get... http://highscalability.com/blog/2015/6/8/leveraging-aws-to-b...

-----

1 point by I_am_lost 3124 days ago | link

Thanks a lot for the links! I'll bookmark them for in-depth study.

-----

1 point by jayhack 3123 days ago | link

Get Splunk and dump everything into it. Especially if you have messy data, this is a good place to start and will take care of integration for you, for the most part.

-----

1 point by brian_spiering 3124 days ago | link

Focus on the fundamentals: 1) Redundancy - Everything needs to be backed-up. Preferable offsite. Any cloud service provider is good for this. 2) Version control - All those scripts need to be in a DVS (distributed version con). GitHub is good for this. 3) Architecture diagram and a plan - Document where you are and where you want to be. That gives you something concrete to discuss. Define the makers (producers) and users (subscribers) of the data. This doesn't have to be formal or perfect. I have found frequently people have different implicit assumptions that are at odds with each other and the users of the data aren't getting data they could use. Externalizing these assumptions helps resolve that tension. 4) Get a budget. Even if it is just orders of magnitude of how much can be spent in time, people, and money. 5) Don't over engineer or throw newest technology at the problem. Start with the simplest (nonsexy) systems. You probably don't need Spark and friends. I would guess that a RDBMS is going to very helpful very shortly.

If this is truly beyond you and the team, hire help. A little bit of consulting will go a long way.

You have a fun and interesting challenge!

-----

1 point by patientfrog 3124 days ago | link

Not sure about a lot of details of what you are dealing with, so of course take my response with a big grain of salt. Still, from what it sounds like, you are in a good place and though you may make some mistakes but it will be a great learning experience.

There are many ways you might want to approach it. Netflix or Google's use cases might be very different than your company's (scale issues aside). First thing to do would be a conversation with the team about what the needs are both now and what they will be in the future.

That said, a good, general, thing to do with your data no matter what it looks like is put it somewhere you would call an archive. My suggestion is S3 since it is cheap (likely cheaper than using an EC2 instance as a storage device), easy to use (=low maintenance), and there are plenty of analytics tools that can read from S3 out-of-the-box like Spark or even EMR. When things get too big or expensive it might be time to move things into HDFS or a more mature file storage system.

As you get a better sense for what types of queries and ETL jobs are common, you might consider a database to store pre-processed data. AWS has lots of options to help you get started that are expensive, but save time if you need to solve problems fast.

The good news is that lots of people have solved these issues before so there are plenty of resources and tools available as your company scales.

It is also worth noting that building, maintaining, and scaling a data infrastructure can be a full-time job-- you might spend many months working on it and not get to do much analysis. Try to set things up so that you are not the only one who "owns" the data infrastructure.

-----

RSS | Announcements