DataTaunew | comments | leaders | submitlogin
1 point by codingvc 3356 days ago | link | parent

Agreed. Data aggregation is very hard. Some common problems include: those who have the data don't want to share or sell it; data is open but only in undocumented/hard-to-use formats; data is available but messy and heterogeneous; and data is clean but there are lots of different sources that need to be unified (and unification is very hard).

A lot of data scientists spend way more time on cleaning or joining or deduping data than they spend building analyses and models. It's frustrating. Fortunately there are more and more tools like Trifacta and OpenRefine that make those tasks easier.



2 points by luckymethod 3356 days ago | link

OpenRefine is essentially abandonware, Trifacta requires fat wallets. This problem is still VERY much waiting for a good everyday solution.

-----

1 point by codingvc 3356 days ago | link

There's definitely room for more tooling! Re: OpenRefine -- I'm not sure if it's still evolving much, but the last time I used it it still saved me a bunch of time.

-----

2 points by luckymethod 3356 days ago | link

last commit was in 2011 and it lost the ability to use freebase. very sad.

-----




RSS | Announcements