Machine Learning vs. Statistics

The Texas Death Match of Data Science  |  August 10th, 2017

Throughout its history, Machine Learning (ML) has coexisted with Statistics uneasily, like an ex-boyfriend accidentally seated with the groom’s family at a wedding reception: both uncertain where to lead the conversation, but painfully aware of the potential for awkwardness. This is caused in part by the fact that Machine Learning has adopted many of Statistics’ methods, but was never intended to replace statistics, or even to have a statistical basis originally. Nevertheless, Statisticians and ML practitioners have often ended up working together, or working on similar tasks, and wondering what each was about. The question, “What’s the difference between Machine Learning and Statistics?” has been asked now for decades.

Machine Learning is largely a hybrid field, taking its inspiration and techniques from all manner of sources. It has changed directions throughout its history and often seemed like an enigma to those outside of it.1 Since Statistics is better understood as a field, and ML seems to overlap with it, the question of the relationship between the two arises frequently. Many answers have been given, ranging from the neutral or dismissive:

  • “Machine learning is essentially a form of applied statistics”
  • “Machine learning is glorified statistics”
  • “Machine learning is statistics scaled up to big data”
  • “The short answer is that there is no difference”

to the questionable or disparaging:

  • In Statistics the loss function is pre-defined and wired to the type of method you are running. In machine learning, you will most likely write a custom program for a unique loss function specific to your problem.
  • “Machine learning is for Computer Science majors who couldn’t pass a Statistics course.”
  • “Machine learning is Statistics minus any checking of models and assumptions.”
  • “I don’t know what Machine Learning will look like in ten years, but whatever it is I’m sure Statisticians will be whining that they did it earlier and better.”

The question has been asked—and continues to be asked regularly—on Quora, StackExchange, LinkedIn, KDNuggets, and other social sites. Worse, there are questions of which field “owns” which techniques [“Is logistic regression a statistical technique or a machine learning one? What if it’s implemented in Spark?”, “Is Regression Analysis Really Machine Learning?” (Mayo, see References)]. We have seen many answers that we regard as misguided, irrelevant, confusing, or just simply wrong.

We (Tom, a Machine Learning practitioner, and Drew, a professional Statistician) have worked together for several years, observing each other’s approaches to analysis and problem solving of data-intensive projects. We have spent hours trying to understand the thought processes and discussing the differences. We believe we have an understanding of the role of each field within data science, which we attempt to articulate here.

The difference, as we see it, is not one of algorithms or practices but of goals and strategies. Neither field is a subset of the other, and neither lays exclusive claim to a technique. They are like two pairs of old men sitting in a park playing two different board games. Both games use the same type of board and the same set of pieces, but each plays by different rules and has a different goal because the games are fundamentally different. Each pair looks at the other’s board with bemusement and thinks they’re not very good at the game.

The purpose of this blog post is to explain the two games being played.

Statistics

Both Statistics and Machine Learning create models from data, but for different purposes. Statisticians are heavily focused on the use of a special type of metric called a statistic. These statistics provide a form of data reduction where raw data is converted into a smaller number of statistics. Two common examples of such statistics are the mean and standard deviation. Statisticians use these statistics for several different purposes. One common way of dividing the field is into the areas of descriptive and inferential statistics.

Descriptive statistics deals with describing the structure of the raw data, generally through the use of visualizations and statistics. These descriptive statistics provide a much simpler way of understanding what can be very complex data. As an example, there are many companies on the various stock exchanges. It can be very difficult to look at the barrage of numbers and understand what is happening in the market. For this reason, you will see commentators talk about how a specific index is up or down, or what some percentage of the companies gained or lost value in the day.

Inferential statistics deals with making statements about data. Though some of the original work dates back to the 18th and 19th century, the field really came into its own with the pioneering work of Karl Pearson, RA Fisher, and others at the turn of the 20th century. Inferential statistics tries to address questions like:

  • Do people in tornado shelters have a higher survival rate than people who hide under bridges?
  • Given a sample of the whole population, what is the estimated size of the population?
  • In a given year, how many people are likely to need medical treatment in the city of Bentonville?
  • How much money should you have in your bank account to be able to cover your monthly expenses 99 out of 100 times?
  • How many people will show up at the local grocery store tomorrow?

The questions deal with both estimation and prediction. If we had complete perfect information, it might be possible to calculate these values exactly. But in the real world, there is always uncertainty. This means that any claim you make has a chance of being wrong—and for some types of claims, it is almost certain you will be slightly wrong. For example, if you are asked to estimate the exact temperature outside your house, and you estimate the value as 29.921730971, it is pretty unlikely that you are exactly correct. And even if you turn out to get it right on the nose, ten seconds later the temperature is likely to be somewhat different.

Inferential statistics tries to deal with this problem. In the absolute best case, the claims made by a statistician will be wrong at least some portion of the time. And unfortunately, it is impossible to decrease the rate of false positives without increasing the rate of false negatives given the same data. The more evidence you demand before claiming that a change is happening, the more likely it is that changes that are happening fail to meet the standard of evidence you require.

Since decisions still have to be made, statistics provides a framework for making better decisions. To do this, statisticians need to be able to assess the probabilities associated with various outcomes. And to do that, statisticians use models. In statistics, the goal of modeling is approximating and then understanding the data-generating process, with the goal of answering the question you actually care about.

The models provide the mathematical framework needed to make estimations and predictions. In practice, a statistician has to make trade-offs between using models with strong assumptions or weak assumptions. Using strong assumptions generally means you can reduce the variance of your estimator (a good thing) at the cost of risking more model bias (a bad thing), and vice versa. The problem is that the statistician will have to decide which approach to use without having certainty about which approach is best.

Since statisticians are required to draw formal conclusions, the goal is to prepare every statistical analysis as if you were going to be an expert witness at a trial.

This is an aspirational goal: in practice, statisticians often perform simple analyses that are not intended to stand up in a court of law. But the basic idea is sound. A statistician should perform an analysis with the expectation that it will be challenged, so each choice made in the analysis must be defensible.

It is important to understand the implications of this. The analysis is the final product. Ideally, every step should be documented and supported, including data cleaning steps and human observations leading to a model selection. Each assumption of the model should be listed and checked, and every diagnostic test run and its results reported. The statistician’s analysis, in effect, guarantees that the model is an appropriate fit for the data under a specified set of conditions.

In conclusion, the Statistician is concerned primarily with model validity, accurate estimation of model parameters, and inference from the model. However, prediction of unseen data points, a major concern of Machine Learning, is less of a concern to the statistician. Statisticians have the techniques to do prediction, but these are just special cases of inference in general.

Machine Learning

Machine Learning has had many twists and turns in its history. Originally it was part of AI and was very aligned with it, concerned with all the ways in which human intelligent behavior could be learned. In the last few decades, as with much of AI, it has shifted to an engineering/performance approach, in which the goal is to achieve a fairly specific task with high performance. In Machine Learning, the predominant task is predictive modeling: the creation of models for the purpose of predicting labels of new examples. We put aside other concerns of Machine Learning for the moment, as predictive analytics is the dominant sub-field and the one with which Statistics so often is compared.

We briefly define the process in order to be clear. In predictive analytics, the ML algorithm is given a set of historical labeled examples. Each example has a label, which, depending on the problem type, can be either the name of a class (classification) or a numeric value (regression). It creates a model, the purpose which is prediction. Specifically, the learning algorithm analyzes the data examples and creates a procedure that, given a new unseen example, can accurately predict its class. Some portion of the data is set aside (the holdout set) and used to validate the model. Alternatively, a method like the bootstrap or cross-validation can be employed to reuse the data in a principled way.

Predictive modeling can have great value this way. A model with good performance characteristics can predict which customers are valuable, which transactions are fraudulent, which customers are good loan risks, when a device is about to fail, whether a patient has cancer, and so on. This all assumes that the future will be similar to the past, and that historical patterns that occurred frequently enough will occur again. This presumes some degree of causality, of course, and such causal assumptions must be validated.

In contrast to Statistics, note that the goal here to generate the best prediction. The ML practitioner usually does some exploratory data analysis, but only to prepare the data and to guide the choice of features and a model family. The model does not represent a belief about or a commitment to the data generation process. Its purpose is purely functional. No ML practitioner would be prepared to testify to the “validity” of a model; this has no meaning in Machine Learning, since the model is really only instrumental to its performance.2 The motto of Machine Learning may as well be: The proof of the model is in the test set.

This approach has a number of important implications that distance ML from Statistics.

  1. ML practitioners are freed from worrying about model assumptions or diagnostics. Model assumptions are only a problem if they cause bad predictions. Of course, practitioners often perform standard exploratory data analysis (EDA) to guide selection of a model type. But since test set performance is the ultimate arbiter of model quality, the practitioner can usually relegate assumption testing to model evaluation.
  2. Perhaps more importantly, ML practitioners are freed from worrying about difficult cases where assumptions are violated, yet the model may work anyway. Such cases are not uncommon. For example, the theory behind the Naive Bayes classifier assumes attribute independence, but in practice it performs well in many domains containing dependent attributes (DomingosPazzani, see References). Similarly, Logistic Regression assumes non-colinear predictors yet often tolerates colinearity. Techniques that assume Gaussian distributions often work when the distribution is only Gaussian-ish.
  3. Unlike the Statistician, the ML practitioner assumes the samples are chosen independent and identically distributed (IID) from a static population, and are representative of that population. If the population changes such that the sample is no longer representative, all bets are off. In other words, the test set is a random sample from the population of interest. If the population is subject to change (called concept drift in ML) some techniques can be brought into play to test and adjust for this, but by default the ML practitioner is not responsible if the sample becomes unrepresentative.
  4. Very often, the goal of predictive analytics is ultimately to deploy the prediction method so the decision is automated. It becomes part of a pipeline in which it consumes some data and emits decisions. Thus the data scientist has to keep in mind pragmatic computational concerns: how will this be implemented? how fast does it have to be? where does the model get its data and what does it do with the final decision? Such computational concerns are usually foreign to Statisticians.

To a Statistician, Machine Learning may look like an engineering discipline [Note from Drew: You bet it does! But that is not a bad thing], rather than science—and to an extent this is true. Because ML practitioners do not have to justify model choice or test assumptions, they are free to choose from among a much larger set of models. In essence, all ML techniques employ a single diagnostic test: the prediction performance on a holdout set. And because Machine Learning often deals with large data sets, the ML practitioner can choose non-parametric models that typically require a great deal more data than parametric models.

As a typical example, consider random forests and boosted decision trees. The theory of how these work is well known and understood. Both are non-parametric techniques that require a relatively large number of examples to fit. Neither has diagnostic tests nor assumptions about when they can and cannot be used. Both are “black box” models that produce nearly unintelligible classifiers. For these reasons, a Statistician would be reluctant to choose them. Yet they are surprisingly—almost amazingly—successful at prediction problems. They have scored highly on many Kaggle competitions, and are standard go-to models for participants to use.

Conclusion

There are great areas of Statistics and Machine Learning we have said nothing about, such as clustering, association rules, feature selection, evaluation methodologies, etc. The two fields don’t always see eye-to-eye on these, but we are aware of little confusion on their fundamental use. We concentrate here on predictive modeling, which seems to be the main point of friction between the fields.

In summary, both Statistics and Machine Learning contribute to Data Science but they have different goals and make different contributions. Though the methods and reasoning may overlap, the purposes rarely do. Calling Machine Learning “applied Statistics” is misleading, and does a disservice to both fields.

Much has been made of these differences. Machine learning is generally taught as part of the computer science curriculum, and statistics is taught either by a dedicated department or as part of the math department. Computer scientists are taught to design real-world algorithms that will be used as part of software packages, while statisticians are trained to provide the mathematical foundation for scientific research. In many cases, both fields use different terminology when referring to exactly the same thing.3 Putting the two groups together into a common data science team (while often adding individuals trained in other scientific fields) can create a very interesting team dynamic.

However, the two different approaches share very important similarities. Fundamentally, both ML and Statistics work with data to solve problems. In many of the dialogues we have had over the past few years, it is obvious that we are thinking about many of the same basic issues. Machine learning may emphasize prediction, and statistics may focus more on estimation and inference, but both focus on using mathematical techniques to answer questions. Perhaps more importantly, the common dialogue can bring improvements in both fields. For example, topics such as regularization and resampling are of relevance to both types of problems, and both fields have contributed to improvements.

 

NOTE: We were made aware, after writing this blog post, that some of our points are made in Leo Breiman’s 2001 journal article “Statistical Modeling: The Two Cultures” (Breiman, see References). We don’t claim to present or summarize his point of view; we wanted our posting to be fairly short and a quick read of our own points of view. Since Breiman’s article is more elaborate than this essay, and his work is always worth reading, we refer the reader to it.

 

References

  • Breiman: Statistical Modeling: The Two Cultures. Breiman, L. Statistical Science (2001) 16: 3, 199–231.
  • DomingosPazzani: On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Domingos, P. & Pazzani, M. Machine Learning (1997) 29: 103. doi:10.1023/A:1007413511361
  • Freitas: Comprehensible Classification Models—a position paper. ACM SIGKDD Explorations Newsletter. 15: 1, June 2013. pp. 1–10.
  • Mayo: Is Regression Analysis Really a Machine Learning Tool. KDnuggets, June 17, 2017. Matthew Mayo. Online here.
  • Shmueli: To Explain or to Predict? Statistical Science. 25: 3, 2010. pp. 289–310. DOI: 10.1214/10-STS330

 

1. Where machine learning has been and the path it has taken makes for an interesting story, but one which is longer than this blog posting. Suffice it to say that Machine Learning is a lot like a war orphan: it has sketchy lineage, it has been through a lot, and has seen a lot, not all of which it wants to remember.
2. This is not strictly true. We are exaggerating to make a point. Some ML practitioners care about model intelligibility. Both Freitas and Shmueli (see References) have written about the importance of intelligible data models and descriptive data analysis. But these papers simply reinforce the original point: the community must be reminded that intelligibility is desirable because it is so often forgotten.
3. Robert Tibshiriani’s glossary [PDF] provides some guidance here.