Translating Between Statistics and Machine Learning

Statistics and machine learning often use different terminology for similar concepts. I recently confronted this when I began reading about maximum causal entropy as part of a project on inverse reinforcement learning. Many of the terms were unfamiliar to me, but as I read closer, I realized that the concepts had close relationships with statistics concepts. This blog post presents a table of connections between terms that are standard in statistics and their related counterparts in machine learning.

Understanding a result in machine learning can help to avoid reinventing the wheel in statistics and vice versa. My ability to understand inverse reinforcement learning benefited from my training in statistics because I was able to translate machine learning terminology into statistical terminology. Translation takes effort, however, and my research would have proceeded even more smoothly if translation were not required. This experience motivated me to compile a table of common statistics and machine learning terms and connections between them.

Each row of the table links a set of terms that are standard in statistics literature (first column) with related concepts in machine learning (second column). Notes in the third column explain the relationship. Several disclaimers apply:

The correspondences are not precise equivalences; appreciating them requires a certain amount of poetic interpretation.
I've drawn on previous efforts to translate between statistics and machine learning, such as Larry Wasserman's blog and comments in the wild.
My entries are interpretive and far from exhaustive. I may update the table from time to time, and I especially invite feedback in the comments section below.

Statistics	Machine learning	Notes
data point, record, row of data	example, instance	Both domains also use "observation," which can refer to a single measurement or an entire vector of attributes depending on context.
response variable, dependent variable	label, output	Both domains also use "target." Since practically all variables depend on other variables, the term "dependent variable" is potentially misleading.
variable, covariate, predictor, independent variable	feature, side information, input	The term "independent variable" exists for historical reasons but is usually misleading--such a variable typically depends on other variables in the model.
regressions	supervised learners, machines	Both estimate output(s) in terms of input(s).
estimation	learning	Both translate data into quantitative claims, becoming more accurate as the supply of relevant data increases.
hypothesis ≠ classifier	hypothesis	In both statistics and ML, a hypothesis is a scientific statement to be scrutinized, such as "The true value of this parameter is zero." In ML (but not in statistics), a hypothesis can also refer to the prediction rule that is output by a classifier algorithm.
bias ≠ regression intercept	bias	Statistics distinguishes between (a) bias as form of estimation error and (b) the default prediction of a linear model in the special case where all inputs are 0. ML sometimes uses "bias" to refer to both of these concepts, although the best ML researchers certainly understand the difference.
Maximize the likelihood to estimate model parameters	If your target distribution is discrete (such as in logistic regression), minimize the entropy to derive the best parameters. If your target distribution is continuous, fine, just maximize the likelihood.	For discrete distributions, maximizing the likelihood is equivalent to minimizing the entropy.
Apply Occam's razor, or encode missing prior information with suitably uninformative priors.	Apply the principle of maximum entropy.	The principle of maximum entropy is conceptual and does not refer to maximizing a concrete objective function. The principle is that models should be conservative in the sense that they be no more confident in the predictions than is thoroughly justified by the data. In practice this works out as deriving an estimation procedure in terms of a bare-minimum set of criteria as exemplified here or here.
logistic/multinomial regression	maximum entropy, MaxEnt	They are equivalent except in special multinomial settings like ordinal logistic regression. Note that maximum entropy here refers to the principle of maximum entropy, not the form of the objective function. Indeed, in MaxEnt, you minimize rather than maximize the entropy expression.
X causes Y if surgical (or randomized controlled) manipulations in X are correlated with changes in Y	X causes Y if it doesn't obviously not cause Y. For example, X causes Y if X precedes Y in time (or is at least contemporaneous)	The stats definition is more aligned with common-sense intuition than the ML one proposed here. In fairness, not all ML practitioners are so abusive of causation terminology, and some of the blame belongs with even earlier abuses such as Granger causality.
structural equations model	Bayesian network	These are nearly equivalent mathematically, although interpretations differ by use case, as discussed.
sequential experimental design	active learning, reinforcement learning, hyperparameter optimization	Although these four subfields are very different from each other in terms of their standard use cases, they all address problems of optimization via a sequence of queries/experiments.

The bifurcation of machine learning and statistics terminology has its roots in a historical disconnect between the efforts of computer scientists and statisticians. The table in this 1983 paper shows that the resulting translation challenges have been around for a long time. I'm willing to bet that new opportunities for translation will arise for as long as the fields of statistics and machine learning continue to expand.

Additional Resources

Read the SEI Blog Post Machine Learning in Cybersecurity.

Software Engineering Institute

SEI Blog

Translating Between Statistics and Machine Learning

Zachary Kurtz

November 19, 2018

PUBLISHED IN

CITE

TAGS

SHARE

Written By

Zachary Kurtz

Digital Library Publications

Send a Message

More By The Author

Using Machine Learning to Detect Design Patterns

March 16, 2020 • By Robert Nord, Zachary Kurtz

The Vectors of Code: On Machine Learning for Software

June 10, 2019 • By Zachary Kurtz

Test Suites as a Source of Training Data for Static Analysis Alert Classifiers

April 30, 2018 • By Lori Flynn, Zachary Kurtz

More In Artificial Intelligence Engineering

Auditing Bias in Large Language Models

July 22, 2024 • By Katherine-Marie Robinson, Violet Turri

Cost-Effective AI Infrastructure: 5 Lessons Learned

May 13, 2024 • By William Nichols, Bryan Brown

Applying Large Language Models to DoD Software Acquisition: An Initial Experiment

April 1, 2024 • By Douglas Schmidt (Vanderbilt University), John E. Robert

OpenAI Collaboration Yields 14 Recommendations for Evaluating LLMs for Cybersecurity

February 21, 2024 • By Jeff Gennari, Shing-hon Lau, Samuel J. Perl

Using ChatGPT to Analyze Your Code? Not So Fast

February 12, 2024 • By Mark Sherman