DataTaunew | comments | leaders | submitlogin
1 point by binalpatel 3323 days ago | link | parent

EC2 server spins up every morning, pulls data from several APIs, internal sources, etc. Data is saved as flatfiles into S3, and then loaded into RedShift via COPY. ETL done with a combination of Pentaho Spoon (to create flows, i.e. do this, then this, then this) and Python scripts (which do the bulk of the downloading and processing).

A monthly load also occurs of a very large dataset, using Elastic MapReduce and MRJob/PIG to process and clean the data.




RSS | Announcements