DataTaunew | comments | leaders | submitlogin
1 point by Malarkey 3794 days ago | link | parent

I agree with every word you say. Now I have a moment (thx Santa) I will just add that most scientists - and many data scientists - are not primarily (or at least not only) interested in Kaggle style best prediction performance . They are fitting models because they are interested in understanding the system and the features of the model.

So if you have a linear model that fits some data with 90% accuracy this is more useful than a neural network that predicts 92% - because the GLM gives you coefficients, and maybe regularisation or feature selection - something you can interpret or describe to others.

Further whilst there have been great strides in deep learning, SVM, ensemble methods, it is often the case that complex models are fit very particularly to the dataset. Not overfit in the sense that they perform poorly on the set-aside test set, but rather may perform poorly if another dataset is obtained by different people at at a different point in time with different equipment or from a different population. This is a well known phenomenon in science/medicine which makes people wary of pushing for the absolute maximum best tuned kaggle stylee predictor.

Now I'm not knocking deep learning or similar (as an ex PhD in visual neuroscience I find it thrilling). These methods may clearly revolutionise speech recognition, self driving cars, image identification etc. Yet such systems are a particular type of AI where you just want the system say to recognise a puppy or avoid a lamppost - not output a general law of puppies or lampposts. Indeed much of their power seems to arise from them abstracting what puppies or lampposts are. These systems are amazing.

But you shouldn't confuse them with the other class of data science or just science which is about summarising (not abstracting) a complex system in a way you can understand.

.. so tldr black box methods are fascinating and great for a whole class of problems but for many data science or science applications they aren't necessary, may overfit, require a lot of tuning effort, or aren't very informative (in the colloquial sense of the word informative).




RSS | Announcements