DataTaunew | comments | leaders | submitlogin
Which Tool To Use For Your Data Pipelines? (jeannicholashould.com)
11 points by nickhould 2836 days ago | 7 comments


3 points by mfcabrera 2835 days ago | link

what about Luigi? For a serious comparisson: http://bytepawn.com/luigi-airflow-pinball.html

-----

2 points by nickhould 2834 days ago | link

I have never used Luigi. It's been around for a while and it is still active maintained with a great community around it. We considered it in our selection process. We preferred Airflow because it included the scheduler. The author of that comparison would also go with Airflow.

-----

2 points by tomkinstinch 2835 days ago | link

Where I work, we've been using Snakemake[1] or data pipeliens. It's like GNU-make but with a Pythonic syntax. It determines which processing steps need to be executed by building a directed acyclic graph from the end to the beginning; it figures out which operations are needed to produce a given output, then looks at those and figures out their inputs, and so it. It can even submit parallel processing jobs to a batch-queueing cluster.

1. https://bitbucket.org/snakemake/snakemake/wiki/Home

-----

2 points by nickhould 2834 days ago | link

Interesting. Does it handle the scheduling?

Airflow uses the directed acyclic graph (DAG). You can visualize those graph and build those sequences.

-----

1 point by LevonK 2833 days ago | link

https://github.com/azkaban/azkaban has been working really well for us. We POC'd Luigi, Jenkins, Rundeck, Oozie & Chronos.

By far Azkaban was the most appropriate. We're investigating coupling it with Goblin and https://github.com/linkedin/WhereHows which has native integration.

Apache NiFi looks interesting as well, but we haven't looked at it yet.

-----

1 point by gps13 2835 days ago | link

Hey Jean-Nicholas, great post. You could try out https://www.blendo.co/ too. If you like it (I am sure you will) you may add it in the list too ;) (disclaimer: I am one of its co-founders)

-----

1 point by nickhould 2834 days ago | link

There's quite a time investment to "try" those solution :). How does it compare with RJ Metrics Pipeline and Segment Sources?

-----




RSS | Announcements