The Myth of Machine Learning Non-Reproducibility and Randomness for Acquisitions and Testing, Evaluation, Verification, and Validation

When the Wright Brothers began their experimentations with flight, they realized they were encountering a data reproducibility problem: the accepted equations to determine lift and drag only worked at one altitude. To solve this problem, they built a homemade wind tunnel, tested various wing types, and recorded performance data. Without the ability to reproduce experiments and identify incorrect data, flight may have been set back by decades.

A reproducibility challenge faces machine learning (ML) systems today. The testing, evaluation, verification, and validation (TEVV) of ML systems presents unique challenges that are often absent in traditional software systems. The introduction of randomness to improve training outcomes and the frequent lack of deterministic modes during development and testing often give the impression that models are difficult to test and produce inconsistent results. However, configurations that increase reproducibility are achievable within ML systems, and they should be made available to the engineering and TEVV communities. In this post, we explain why unpredictability is prevalent, how it can be addressed, and the pros and cons of addressing it. We conclude with why, despite the challenges of addressing unpredictability, it is important for our communities to expect predictable and reproducible modes for ML components, especially for TEVV.

ML Reproducibility Challenges

The nature of ML systems contributes to the challenge of reproducibility. ML components implement statistical models that provide predictions about some input, such as whether an image is a tank or a car. But it is difficult to provide guarantees about these predictions. As a result, guarantees about the resulting probabilistic distributions are often given only in limits, that is, as distributions across a growing sample. These outputs can also be described by calibration scores and statistical coverage, such as, “We expect the true value of the parameter to be in the range [0.81, 0.85] 95 percent of the time.” For example, imagine an ML model trained to classify civilian and military vehicles. When provided with an input image, the model will produce a set of scores, ideally which are calibrated, such as (0.90, 0.07, 0.03), meaning that similar images would be predicted as a military vehicle 90 percent of the time, a civilian vehicle seven percent of the time, and three percent as other.

Neural Networks and Training Challenges

At the center of the current discussion of reproducibility in machine learning are the mechanisms of neural networks. Neural networks are networks of nodes connected by weighted links. Each link has a value that shows how much the output of one node influences outputs of the linked node, and thus further nodes in the path to the final output. Collectively these values are known as the network weights or parameters. The technique of supervised training for a neural network involves passing in input data and a corresponding ground-truth label that ideally will match the output of the trained network—that is, the label specifies the intended way the trained network will classify the input data. Over many data samples, the network learns how to classify inputs to those labels through various feedback mechanisms that adjust the network weights over the process of training.

Training is dependent on many factors that can introduce randomness. For example, when we don’t have an initial set of weights from a pre-trained foundation model, research has shown that seeding an untrained network with randomly assigned weights works better for training than seeding with constant values. As the model learns, the random weights—the equivalent of noise—are adjusted to improve predictions from random values to values more likely closer. Additionally, the training process can involve repeatedly providing the same training data to the model, because conventional models learn only gradually. Some research shows that models may learn better and become more robust if the data are slightly modified or augmented and reordered each time they are passed in for training. These augmentation and reordering processes are also more effective if they are trained on data that has been subject to small random modifications instead of systematic changes (e.g., images that have been rotated by 10 degrees every time or cropped in successively smaller sizes.) Thus, to provide these data in a non-systematic way, a randomizer is used to introduce a robust set of randomly modified images for training.

Though we often refer to these processes and techniques as being random, they are not. Many basic computer components are deterministic, though determinism can be compromised from concurrent and distributed algorithms. Many algorithms depend on having a source of random numbers to be efficient, including the training process described above. A key challenge is finding a source of randomness. In this regard, we distinguish true random numbers, which require access to a physical source of entropy, from pseudorandom numbers, which are algorithmically created. True randomness is abundant in nature, but difficult to access in an algorithm on modern computers, and so we generally rely on pseudorandom number generators (PRNGs) that are algorithmic. A PRNG takes, “one or more inputs called ‘seeds,’ and it outputs a sequence of values that appears to be random according to specified statistical tests,” but are actually deterministic with respect to the particular seed.

These factors lead to the two consequences regarding reproducibility:

When training ML models, we use PRNGs to intentionally introduce randomness during training to improve the models.
When we train on many distributed systems to increase performance, we do not force ordering of results, as this generally requires synchronizing processes which inhibit performance. The result is a process which started off fully deterministic and reproducible but has become what appears to be random and non-deterministic because of intentional pseudorandom number injection and that adds additional randomness due to the unpredictability of ordering across the distributed implementation.

Implications for TEVV

These factors create unique challenges for TEVV, and we explore here methods to mitigate these difficulties. During development and debugging, we generally start with reproducible known tests and introduce changes until we discover which change created the new effect. Thus, developers and testers both benefit greatly from well-understood configurations that provide reference points for many purposes. When there is intentional randomness in training and testing, this repeatability can be obtained by controlling random seeds as a means to achieve a deterministic ordering of results.

Many organizations providing ML capabilities are still in the technology maturation or startup mode. For example, recent research has documented a variety of cultural and organizational challenges in adopting modern safety practices such as system-theoretic process analysis (STPA) or failure mode and effects analysis (FMEA) for ML systems.

Controlling Reproducibility in TEVV

There are two basic techniques we can use to manage reproducibility. First, we control the seeds for every randomizer used. In practice there may be many. Second, we need a way to tell the system to serialize the training process executed across concurrent and distributed resources. Both approaches require the platform provider to include this sort of support. For example, in their documentation, PyTorch, a platform for machine learning, explains how to set the various random seeds it uses, the deterministic modes, and their implications on performance. We suggest that for development and TEVV purposes, any derivative platforms or tools built on these platforms should expose and encourage these settings to the developer and implement their own controls for the features they provide.

It is important to note that this support for reproducibility does not come for free. A provider must expend effort to design, develop, and test this functionality as they would with any feature. Additionally, any platform built upon these technologies must continue to expose these configuration settings and practices through to the end user, which can take time and money. Juneberry, a framework for machine learning experimentation developed by the SEI, is an example of a platform that has spent the effort on exposing the configuration needed for reproducibility.

Despite the importance of these exact reproducibility modes, they should not be enabled during production. Engineering and testing should use these configurations for setup, debugging and reference tests, but not during final development or operational testing. Reproducibility modes can lead to non-optimal results (e.g., minima during optimization), reduced performance, and possibly also security vulnerabilities as they allow external users to predict many conditions. However, testing and evaluation can still be conducted during production, and there are many available statistical tests and heuristics to assess whether the production system is working as intended. These production tests will need to account for inconsistency and should check to see that these deterministic modes are not displayed during operational testing.

Three Recommendations for Acquisition and TEVV

Considering these challenges, we offer three recommendations for the TEVV and acquisition communities:

The acquisition community should require reproducibility and diagnostic modes. These requirements should be included in RFPs.
The testing community should understand how to use these modes in support of final certification, including some testing with the modes disabled.
Provider organizations should include reproducibility and diagnostic modes in their products. These objectives are readily achievable if required and designed into a system from the beginning. Without this support, engineering and test costs will be significantly increased, potentially exceeding the cost in implementing these features, as defects not caught during development cost more to fix when discovered in later stages.

Reproducibility and determinism can be controlled during development and testing. This requires early attention to design and engineering and some small increment in cost. Providers should have an incentive to provide these features based on the reduction in likely costs and risks in acceptance evaluation.

Software Engineering Institute

SEI Blog