DataTaunew | comments | leaders | submitlogin
2 points by kiyoto 3357 days ago | link | parent

Your post struck a chord with me. Acquiring data is a huge part of one's job as a data analyst/scientist, but there nearly isn't enough tools or resources on "practical data collection."

For example, I recently became curious about Product Hunt and its growth. What I ended up spending most of my time (before plotting pretty charts with ggplot) was reverse-engineering Product Hunt's API to download a bunch of data. Stuff like this is never explicitly taught but hugely valuable if you want to use data as your decision-informing tool.



1 point by codingvc 3356 days ago | link

Agreed. Data aggregation is very hard. Some common problems include: those who have the data don't want to share or sell it; data is open but only in undocumented/hard-to-use formats; data is available but messy and heterogeneous; and data is clean but there are lots of different sources that need to be unified (and unification is very hard).

A lot of data scientists spend way more time on cleaning or joining or deduping data than they spend building analyses and models. It's frustrating. Fortunately there are more and more tools like Trifacta and OpenRefine that make those tasks easier.

-----

2 points by luckymethod 3356 days ago | link

OpenRefine is essentially abandonware, Trifacta requires fat wallets. This problem is still VERY much waiting for a good everyday solution.

-----

1 point by codingvc 3356 days ago | link

There's definitely room for more tooling! Re: OpenRefine -- I'm not sure if it's still evolving much, but the last time I used it it still saved me a bunch of time.

-----

2 points by luckymethod 3356 days ago | link

last commit was in 2011 and it lost the ability to use freebase. very sad.

-----




RSS | Announcements