DataTaunew | comments | leaders | submitlogin
The Value of Data, Part 1: Using Data as a Competitive Advantage (codingvc.com)
10 points by codingvc 3330 days ago | 8 comments


4 points by codingvc 3330 days ago | link

(I'm the author of the blog post.)

I'm not really a data scientist, more of an engineer who has worked on a lot of data projects. I'd love to know what the data science community thinks of my recent post on using data as a competitive advantage. Are there things that I missed or got wrong?

-----

2 points by luckymethod 3324 days ago | link

I agree with most of your post, but I still think that most of the "data science advantage" you assume drives the success of services like Google and Netflix is just good old network effect at work.

Example: the recommendation engine in Netflix is mediocre at best, and in my 3+ year as a user I never gotten any useful suggestion out of it. What they do have is Breaking Bad, House of Cards and a lot of other movies I wanted to watch. If they lost their good content, I wouldn't think twice about canceling, data or not data.

-----

1 point by codingvc 3324 days ago | link

I think you're right that data is not the only moat, but it definitely helps. I think it's also important to separate personal preferences from the general public (something I've learned and re-learned as a VC). That is, you might go to Netflix for specific/great content -- and I do, too -- but lots of people will just watch whatever Netflix recommends because the recs are good.

Two relevant articles:

- http://streamdaily.tv/2014/10/10/netflixs-data-engine-worth-... -- Netflix's chief product officer gave a $500m/year ballpark estimate of how much the recommendation engine is worth to Netflix. (This would be about 10% of Netflix's revenue)

- http://techblog.netflix.com/2012/04/netflix-recommendations-... -- "We have adapted our personalization algorithms to this new scenario in such a way that now 75% of what people watch is from some sort of recommendation."

-----

2 points by kiyoto 3325 days ago | link

Your post struck a chord with me. Acquiring data is a huge part of one's job as a data analyst/scientist, but there nearly isn't enough tools or resources on "practical data collection."

For example, I recently became curious about Product Hunt and its growth. What I ended up spending most of my time (before plotting pretty charts with ggplot) was reverse-engineering Product Hunt's API to download a bunch of data. Stuff like this is never explicitly taught but hugely valuable if you want to use data as your decision-informing tool.

-----

1 point by codingvc 3324 days ago | link

Agreed. Data aggregation is very hard. Some common problems include: those who have the data don't want to share or sell it; data is open but only in undocumented/hard-to-use formats; data is available but messy and heterogeneous; and data is clean but there are lots of different sources that need to be unified (and unification is very hard).

A lot of data scientists spend way more time on cleaning or joining or deduping data than they spend building analyses and models. It's frustrating. Fortunately there are more and more tools like Trifacta and OpenRefine that make those tasks easier.

-----

2 points by luckymethod 3324 days ago | link

OpenRefine is essentially abandonware, Trifacta requires fat wallets. This problem is still VERY much waiting for a good everyday solution.

-----

1 point by codingvc 3324 days ago | link

There's definitely room for more tooling! Re: OpenRefine -- I'm not sure if it's still evolving much, but the last time I used it it still saved me a bunch of time.

-----

2 points by luckymethod 3324 days ago | link

last commit was in 2011 and it lost the ability to use freebase. very sad.

-----




RSS | Announcements