DataTau | No, you're not a data scientist

DataTau

	No, you're not a data scientist (nodejitsu.com)
	12 points by Quietlike 3796 days ago \| 20 comments

12 points by nofreehunch 3795 days ago | link

Data kiddies like me are coming.

I just ran multiple passes of the Broyden–Fletcher–Goldfarb–Shanno algorithm with a 100-layer neural network on a tfidf-vectorized dataset. I have no clue what that all exactly means, all I know is that it took under an hour and it gives a higher (top 10%) AUC score.

Kaggler amateurs are beating the academics by brute force or smarter use of the many tools that are currently freely available.

Show a regular Python dev some examples and library docs and she can compete in ML competitions.

I was getting good results with LibSVM before I even understood how SVM's work on the surface. Feed the correct input format and some parameters and you are good to go. Random Forests can be applied to nearly anything and get you 75%+ accuracy.

Maybe I am just a engineer looking for pragmatic and practical use of techniques from ML and data science. Hard data scientists will be the statisticians, the algorithmic theory experts, the experimental physicists. It takes me 7 years to understand a complex mathematical paper. It takes me 7 minutes to train a model and predict a 1 million test set with Vowpal Wabbit.

-----

5 points by j2kun 3795 days ago | link

I think your analogue of "script-kiddies" from the hacker world is very poignant.

-----

3 points by Quietlike 3795 days ago | link

Wow! I would love to see a write-up from you about this. Seriously, your comment is one of the most intriguing things I've read in a while.

I'm pretty ignorant about the context of what you did, and would love to hear more about it.

-----

3 points by roycoding 3795 days ago | link

Kaggler amateurs are beating the academics by brute force or smarter use of the many tools that are currently freely available.

The Kaggle forums is a great place to pick up practical machine learning knowledge. Especially post-competition discussion by top finishers.

-----

2 points by Malarkey 3795 days ago | link

Just for you I've posted the simply statistics unconference on the future of stats. You maybe interested in the discussion of "inference" as opposed to prediction.

-----

1 point by zmjjmz 3794 days ago | link

I think that a mixture of both trying out black box algorithms and then figuring out what the hell they actually do (if they work) isn't a bad strategy (a sort of lazy learning of machine learning).

I personally have taken a theory based ML class and understand most of the theoretical underpinnings of complex models (and more importantly, simple models) but we couldn't have possibly covered everything in a single semester. However, the theory we covered rarely dealt with specific models, and was more about overfitting and regularization and data snooping and the like.

I think that while it might be important to understand the meanings of different models, it's more important that you understand the theory that applies to all of the models, namely how to properly work with your data to prevent overfitting and how to properly test your models to show generalization error. Everything else is essentially hyperparameter tuning.

-----

1 point by Malarkey 3794 days ago | link

I agree with every word you say. Now I have a moment (thx Santa) I will just add that most scientists - and many data scientists - are not primarily (or at least not only) interested in Kaggle style best prediction performance . They are fitting models because they are interested in understanding the system and the features of the model.

So if you have a linear model that fits some data with 90% accuracy this is more useful than a neural network that predicts 92% - because the GLM gives you coefficients, and maybe regularisation or feature selection - something you can interpret or describe to others.

Further whilst there have been great strides in deep learning, SVM, ensemble methods, it is often the case that complex models are fit very particularly to the dataset. Not overfit in the sense that they perform poorly on the set-aside test set, but rather may perform poorly if another dataset is obtained by different people at at a different point in time with different equipment or from a different population. This is a well known phenomenon in science/medicine which makes people wary of pushing for the absolute maximum best tuned kaggle stylee predictor.

Now I'm not knocking deep learning or similar (as an ex PhD in visual neuroscience I find it thrilling). These methods may clearly revolutionise speech recognition, self driving cars, image identification etc. Yet such systems are a particular type of AI where you just want the system say to recognise a puppy or avoid a lamppost - not output a general law of puppies or lampposts. Indeed much of their power seems to arise from them abstracting what puppies or lampposts are. These systems are amazing.

But you shouldn't confuse them with the other class of data science or just science which is about summarising (not abstracting) a complex system in a way you can understand.

.. so tldr black box methods are fascinating and great for a whole class of problems but for many data science or science applications they aren't necessary, may overfit, require a lot of tuning effort, or aren't very informative (in the colloquial sense of the word informative).

-----

4 points by mpearce 3794 days ago | link

I'm surprised nobody's commented on that 'road map' yet. It's basically a selection of keywords strung together in a disordered homage to Harry Beck.

-----

3 points by achompas 3794 days ago | link

Definitely a disorganized list. Some concepts are summed up in a single stats class, others are entire fields of research.

-----

1 point by blob_dillen 3793 days ago | link

right? with a whole mess overlap between tools and methods. Its cool don't get me wrong, but....its not the end all and be all of data science.

-----

3 points by cja23 3794 days ago | link

I sympathize with the author's concern that the term "data scientist" has become diluted and nearly meaningless, and so quickly after it was first coined.

As pointed out by some of the other comments, the term "scientist" isn't even applicable to a lot of this. A scientist studies, experiments, and learns something about their subject. Most of the job listings and communities using the term "data scientist" really mean something closer to "data engineer" or "data analyst", but of course those don't have the sexy implications that "scientist" does, at least until the hype cycle gets done with it.

Bottom line: there's lots to learn and lots to do as the world continues to wake up to "data" as a first class asset worth paying attention to. Don't get too hung up on what titles everyone has or wants, they will continue to change. Focus on the jobs and tasks that the titles point to.

-----

3 points by achompas 3794 days ago | link

I think this post doesn't attack the right issue.

The question isn't whether so-and-so is a data scientist. Most effective data teams have a combination of people like the author, statisticians, ops people etc. Harlan, Marck, and Sean's "Analyzing the Analyzers" matches my experience; if you're thinking about this stuff at all, chances are you fall into one of the categories described in that paper.

nofreehunch talks about using BFGS for neural network parameter estimation, and "[has] no clue what that all exactly means." If that's the case, this form of black box analysis will be commoditized in the future.

I see the Data Science Master's as another catalyst for this commodification. EoSL, Ng's Coursera course, and an O'Reilly book do not even begin to cover the breadth of ML topics, missing things like non-trivial neural networks and reinforcement learning. The single statistics book is a poorly-rated O'Reilly text that doesn't address Bayesian statistics, which is bad bad bad given that we use things like MCMC and variational techniques all the time at work.

These topics have fractal complexity, just like many programming problems, but everything in the curriculum is a topical overview. The roadmap is similarly deceptive: stops on the map like "neural networks" and "bias & variance" are actually super-complex, deserving of semester-long graduate courses.

Be very, very skeptical of these posts. This curriculum grants you basic understanding, but fluency takes much longer.

-----

2 points by sho 3794 days ago | link

I've always found the title Data Janitor at the most honest way of describing 90% of what I spend my time doing.

-----

2 points by earlbellinger 3795 days ago | link

Not quite sure what the point of this post is. It looks like it is trying to shame the reader into taking online courses?

-----

1 point by Quietlike 3795 days ago | link

I think its meant more as a counter to the idea that learning a technology (for example hadoop) makes someone a data scientist.

I think the people on this forum are educated enough to know this to not be the case, but you would be shocked the percentage of people (egged on by marketing) who think that the first step to data science is tools tools tools.

-----

2 points by j2kun 3795 days ago | link

Knowing a person's favorite menagerie of tools, languages, database engines, libraries, and fashionable frameworks does not make you a data scientist. In fact, "knowing" things doesn't make you any kind of scientist.

Real data scientists develop new models, design new frameworks, and apply their techniques to problems that aren't solved well by black-box algorithms. Data scientists create algorithms, design and carry out statistical tests, experiment, and refine. That's what makes it science.

I'm all for learning new stuff, but "data science" is clearly too loose a term.

-----

2 points by jcbozonier 3795 days ago | link

You're right that it's not about knowing, but learning. That's the real point of the scientific process. Science doesn't require the use of novel algorithms and can you imagine where we'd be if scientists were reticent to stand on the shoulders of others?

For myself, data science is about being able to come up with an approach to answer any question you might be asked about your data. It's about being able to quantify the too large and the qualitative and to make it stand alongside the excel spreadsheets and nice clean data you already have.

-----

3 points by earlbellinger 3795 days ago | link

The word "reticent" means "inclined to be silent". I think you're looking for "reluctant", which means "unwilling and hesitant; disinclined".

-----

1 point by jcbozonier 3794 days ago | link

Thank you for that!

-----

1 point by nofreehunch 3795 days ago | link

I had to Google that one too. I took it to mean "shy" and added a new word to my lexicon :).

-----

RSS | Announcements