DataTau | The Real Difference Between Machine Learning and Statistics

DataTau

	The Real Difference Between Machine Learning and Statistics (glv.nz)
	13 points by marksaldana 3164 days ago \| 6 comments

6 points by debrouwere 3164 days ago | link

Definitely a topic that needs to be talked about more, but the comparison between ML and statistics in this post is kind of sloppy.

* statistics uses graphs all the time (causal inference, Bayesian MCMC), machine learning uses models all the time (regression is a staple of both ML and statistics) and in fact most classifiers are really just different ways to model P(class|data)

* ML has no notion of uncertainty around parameters, but of course it has a notion of predictive uncertainty – RMSE, confusion matrices – how else would you rank a classifier or a regressor?

* variable importance scores can be computed for most ML algorithms; this is not the same thing as a parameter estimate but it does show similar concerns; decision trees also hail from ML and they are wonderfully interpretable

* frequentist statistics abhors the idea of prior distributions; on the other hand machine learning techniques often have them built in (e.g. shrinkage methods and regularization are sometimes equivalent to a Laplace prior or somesuch)

* statistics has a lot of cookie-cutter models that you can stick on pretty much anything, fair enough, but if a statistician ever just assumes a model a priori and then refuses to modify it then the fit is bad, fire that statistician; statistical models are absolutely not a priori

The most important difference to me is that statistics focuses on interpretation and generalizability -- not just generalizability to the test data, but generalizability to different people in different countries at different times -- whereas ML focuses on predictive performance on data for which the underlying distribution is not expected to change.

To take a classic toy example: when a statistician investigates whether nicotine stains cause lung cancer in an observational study, they will want to control for smoking as well as various demographic factors. The conclusion will be that nicotine stains do not cause cancer, because it's really the smoking that's doing that. A machine learning approach might instead be try to find people most at risk for lung cancer, and if nicotine stains are a great and cheap to measure predictor for that purpose, then that's wonderful, let's use it! (I'm not being snarky here, who cares if good predictions come from weird predictors as long as your population is stable? As long as confounders don't vary over time, you're good.)

This makes statistical methods often a better choice when you're interested in control (let's change X to change Y), and machine learning better for accurate predictions.

If I were working for a software as a service company and trying to reduce customer churn, I could imagine using both statistical linear models and ML techniques. I would want the best predictions for who is likely to cancel their subscriptions -- and "best" in this context is itself an interesting problem because I would tolerate false positives more than false negatives -- but in trying to keep these "at-risk" customers on board, I would probably rely on statistical knowledge about what factors make people likely to leave or stay.

Another big difference is that statisticians have developed most of their tools for scientific experiments (medicine, psychology), observational studies (social sciences, econometrics) and industrial process control (things like outlier detection.) These are all very different but they are united in that the data is just numbers, numbers, numbers. The great insight of the AI and ML communities was that you can treat pretty much anything and everything like a probabilistic problem, as long as you do enough feature engineering. Face recognition, natural language analysis, driving an autonomous car, you name it. So ML has made statistics and probability exciting again.

That said, I do sometimes miss the skeptical attitude, typical of the statistician, among ML practitioners. ML is more opportunistic, "let's take advantage of the signal before it turns into noise." Statistics is more cautious and skeptical, "if I treat this as signal, it's going to blow up in my face later."

-----

4 points by kiyoto 3164 days ago | link

I also felt that the comparison was somewhat inadequately generalized. I felt that the "statistician" depicted in the OP is more frequentist than Bayesian.

>Statistics is more cautious and skeptical, "if I treat this as signal, it's going to blow up in my face later."

This is especially true about frequentist statistics. To quote the witty statistics giant Efron, "Frequentists are Bayesians that are trying to do not too badly."

-----

1 point by jcbozonier 3162 days ago | link

"ML has no notion of uncertainty around parameters"

You're being pretty specific about what ML is. I don't see a difference between Bayesian Inference and ML. It's OK to have overlap between the two fields. Naive Bayes is a great example of this.

The article might be better titled "Why a Mathematician, Statistician, and Machine Learner Might Solve the Same Problem Differently."

All of these fields are just tools in a toolbox. Defining yourself as someone who only uses a single tool severely limits your options.

-----

1 point by debrouwere 3161 days ago | link

Well, sure, but if you're going to take that tack, then you might as well say "it's all just statistical learning anyway." Which is true, the main reason why ML and statistics are different things is because they've historically grown from different scientific disciplines and have yet to fully merge, not because they're inherently different or because you have to choose just one. But that's not very informative when trying to explain the real differences in attitude and approach between these different communities.

Bayesian inference is actually case in point. If you read any of the work on MCMC and probabilistic programming, machine learning hardly ever gets mentioned because the scholars pushing MCMC don't identify with that community. Why? No reason why, that's just how it is. The only time I have seen MCMC explicitly mentioned as an ML technique is in http://www.mbmlbook.com/, where it's part of a conscious attempt by the author to win over people used to more mainstream ML techniques like random forests and SVMs.

-----

1 point by mikeskim 3161 days ago | link

most machine learning algorithms (rf,gbm,glmnet,svm) were written by statisticians.

-----

1 point by debrouwere 3160 days ago | link

Calling Vladimir Vapnik (the inventor of SVMs) a statistician simply because he studied it is somewhat disingenuous – he's been employed as a professor in computer science for pretty much all his life. Leo Breiman, who invented random forests, probably still identifies as a statistician, but also wrote the influential "Statistical Modeling: The Two Cultures" where he makes a pretty severe break with traditional statistics. So you're being a bit charitable here :-)

-----

RSS | Announcements