DataTaunew | comments | leaders | submitlogin
Ask DT: What does your data warehouse look like?
3 points by cmrn_dp 3315 days ago | 4 comments
I'm curious about what your data warehouse looks like, including schema designs, tech stack, etc.

Shopify:

Follow a strict star schema design pattern. Our data is stored in HDFS, and we use PySpark for ETLs, and store resulting datasets in Amazon's Redshift.



3 points by achompas 3314 days ago | link

No good answer for warehouse at my current gig but we dumped from Cassandra into RS by ETLing (Hadoop) backup sstables in S3 at my last gig. The sstables were backed up to S3 using Netflix Priam.

-----

2 points by larrydag 3312 days ago | link

I'm curious. Do Data Scientists really care? If I were an owner of a business I suppose I would care to some degree. But as a Data Scientist in an organization I'm just a user/customer of the team that produces the data warehouse. I just need the data to do analysis and solve problems.

-----

1 point by binalpatel 3312 days ago | link

I think it depends on the size of the org. In my case I built the warehouse because there was no in-house team doing it, and I needed to get quality data gathered in one place. In my case there's not really a clear delineation between data engineering and data science.

-----

1 point by binalpatel 3313 days ago | link

EC2 server spins up every morning, pulls data from several APIs, internal sources, etc. Data is saved as flatfiles into S3, and then loaded into RedShift via COPY. ETL done with a combination of Pentaho Spoon (to create flows, i.e. do this, then this, then this) and Python scripts (which do the bulk of the downloading and processing).

A monthly load also occurs of a very large dataset, using Elastic MapReduce and MRJob/PIG to process and clean the data.

-----




RSS | Announcements