DataTaunew | comments | leaders | submitlogin
12 points by nofreehunch 3781 days ago | link | parent

Data kiddies like me are coming.

I just ran multiple passes of the Broyden–Fletcher–Goldfarb–Shanno algorithm with a 100-layer neural network on a tfidf-vectorized dataset. I have no clue what that all exactly means, all I know is that it took under an hour and it gives a higher (top 10%) AUC score.

Kaggler amateurs are beating the academics by brute force or smarter use of the many tools that are currently freely available.

Show a regular Python dev some examples and library docs and she can compete in ML competitions.

I was getting good results with LibSVM before I even understood how SVM's work on the surface. Feed the correct input format and some parameters and you are good to go. Random Forests can be applied to nearly anything and get you 75%+ accuracy.

Maybe I am just a engineer looking for pragmatic and practical use of techniques from ML and data science. Hard data scientists will be the statisticians, the algorithmic theory experts, the experimental physicists. It takes me 7 years to understand a complex mathematical paper. It takes me 7 minutes to train a model and predict a 1 million test set with Vowpal Wabbit.



5 points by j2kun 3781 days ago | link

I think your analogue of "script-kiddies" from the hacker world is very poignant.

-----

3 points by Quietlike 3781 days ago | link

Wow! I would love to see a write-up from you about this. Seriously, your comment is one of the most intriguing things I've read in a while.

I'm pretty ignorant about the context of what you did, and would love to hear more about it.

-----

3 points by roycoding 3781 days ago | link

Kaggler amateurs are beating the academics by brute force or smarter use of the many tools that are currently freely available.

The Kaggle forums is a great place to pick up practical machine learning knowledge. Especially post-competition discussion by top finishers.

-----

2 points by Malarkey 3781 days ago | link

Just for you I've posted the simply statistics unconference on the future of stats. You maybe interested in the discussion of "inference" as opposed to prediction.

-----

1 point by zmjjmz 3780 days ago | link

I think that a mixture of both trying out black box algorithms and then figuring out what the hell they actually do (if they work) isn't a bad strategy (a sort of lazy learning of machine learning).

I personally have taken a theory based ML class and understand most of the theoretical underpinnings of complex models (and more importantly, simple models) but we couldn't have possibly covered everything in a single semester. However, the theory we covered rarely dealt with specific models, and was more about overfitting and regularization and data snooping and the like.

I think that while it might be important to understand the meanings of different models, it's more important that you understand the theory that applies to all of the models, namely how to properly work with your data to prevent overfitting and how to properly test your models to show generalization error. Everything else is essentially hyperparameter tuning.

-----

1 point by Malarkey 3780 days ago | link

I agree with every word you say. Now I have a moment (thx Santa) I will just add that most scientists - and many data scientists - are not primarily (or at least not only) interested in Kaggle style best prediction performance . They are fitting models because they are interested in understanding the system and the features of the model.

So if you have a linear model that fits some data with 90% accuracy this is more useful than a neural network that predicts 92% - because the GLM gives you coefficients, and maybe regularisation or feature selection - something you can interpret or describe to others.

Further whilst there have been great strides in deep learning, SVM, ensemble methods, it is often the case that complex models are fit very particularly to the dataset. Not overfit in the sense that they perform poorly on the set-aside test set, but rather may perform poorly if another dataset is obtained by different people at at a different point in time with different equipment or from a different population. This is a well known phenomenon in science/medicine which makes people wary of pushing for the absolute maximum best tuned kaggle stylee predictor.

Now I'm not knocking deep learning or similar (as an ex PhD in visual neuroscience I find it thrilling). These methods may clearly revolutionise speech recognition, self driving cars, image identification etc. Yet such systems are a particular type of AI where you just want the system say to recognise a puppy or avoid a lamppost - not output a general law of puppies or lampposts. Indeed much of their power seems to arise from them abstracting what puppies or lampposts are. These systems are amazing.

But you shouldn't confuse them with the other class of data science or just science which is about summarising (not abstracting) a complex system in a way you can understand.

.. so tldr black box methods are fascinating and great for a whole class of problems but for many data science or science applications they aren't necessary, may overfit, require a lot of tuning effort, or aren't very informative (in the colloquial sense of the word informative).

-----




RSS | Announcements