icon-carat-right menu search cmu-wordmark

Ensuring Machine Learning Models Meet System and Mission Requirements

Created April 2025

As the Department of Defense (DoD) and other federal agencies seek to take advantage of the benefits gained by using machine learning (ML) systems, it’s increasingly necessary to test ML systems to ensure that they work as intended. Test and evaluation (T&E) of ML models is critical because of their data dependent behavior, which can lead to failures if the operational environments and data assumed during model development do not match real operating environments.

While testing ML model properties, such as model performance, is common practice during model development, a system-centric testing methodology that considers system requirements and constraints did not exist until recently. In 2024, the Software Engineering Institute (SEI) released Machine Learning Test and Evaluation (MLTE, referred to as “melt”), a process and tool co-developed by the SEI and the Army AI Integration Center (AI2C) to evaluate ML models at each step in the development process and increase ML model production readiness.

Why So Many ML Models Fail in Production

Testing ML models typically prioritizes only testing model properties, such as model performance (e.g., accuracy). Additionally, this testing is generally not done in connection with the systems in which the ML models will be integrated; therefore, to ensure that the ML system meets mission and system requirements, comprehensive T&E must address the requirements and constraints derived from the ML system.

While developing the MLTE tool and process, the team identified three common challenges in ML development and T&E processes:

MLTE seeks to address these challenges by offering a semi-automated process and tool that is system-centric; quality-attribute driven; and enables negotiation, specification, and testing of ML model and system qualities.

“Testing of ML capabilities in practice is largely limited to model properties, such as model performance, without considerations of system requirements and constraints,” said Grace Lewis, who leads the MLTE work as a Principal Researcher in the SEI’s Software Solutions Division. “While in some cases more comprehensive testing is simply not a common data science practice, in most cases, model developers are not provided any system and mission context to inform development and testing activities. MLTE enables the elicitation and documentation of relevant system information and provides rigor and discipline to T&E for ML capabilities.”

The MLTE Process and Tool

MLTE allows teams to more effectively negotiate and document model requirements, evaluate model functionality, and share test results with all system stakeholders. Designed for interdisciplinary cross-team coordination, the MLTE process facilitates communication by offering specific collaboration points and creating shared artifacts throughout the model development lifecycle. Detecting failures in ML capabilities early will enable more responsive and timelier ML-enabled capability delivery to meet warfighter needs.

Using MLTE, teams begin the development process by eliciting and negotiating model requirements that they will continue through implementation. The MLTE tool provides various items to support this negotiation including (1) negotiation cards to record context and requirements that drive model development and testing and (2) quality attribute (QA) scenarios that are used to concretely define test cases to determine if requirements have been met.

After negotiation cards have been created, teams start internal model testing (IMT) to test an initial model against baseline performance requirements. Once model performance exceeds the baseline requirements, system-dependent model testing (SDMT) is done to determine if a model will meet the requirements and constraints of operating as part of the larger ML-enabled system. The MLTE tool provides a test catalog that contains examples of test cases that can be used by model developers as part of SDMT.

After teams execute the test cases, they can use the MLTE tool to produce reports that communicate the test results and analysis of the findings. The test cases, test code, and test data can be delivered with the model to support system-level T&E and integration activities.

Overall, both the MLTE process and tool can support a variety of ML projects and are designed to be extensible to suit differing user needs. See this fact sheet for more details on the MLTE process and tools.

Ensuring Machine Learning Models Meet System and Mission Requirements

Work with MLTE

MLTE is being actively developed and extended as we work with DoD and other organizations on its adoption as part of their ML-enabled system development process.

Contact us if you would like help instantiating MLTE within your organization. We welcome feedback through email or GitHub issues.

Learn More

MLTE: System-Centric Test and Evaluation of Machine Learning Models

Software

The SEI’s Machine Learning Test and Evaluation is an open source process and tool to evaluate machine learning models from inception to deployment.

Download

MLTE: Machine Learning Test and Evaluation

Fact Sheet

Machine Learning Test and Evaluation is a comprehensive process and tool for testing machine-learning models to help you develop or acquire production-ready models.

Learn More

Improving Machine Learning Test and Evaluation with MLTE

Podcast

Machine learning (ML) models commonly experience issues when integrated into production systems. MLTE provides a process and infrastructure for ML test and evaluation.

Listen

Introducing MLTE: A Systems Approach to Machine Learning Test and Evaluation

Blog Post

Machine learning systems are notoriously difficult to test. This post introduces Machine Learning Test and Evaluation (MLTE), a new process and tool to mitigate this problem and create safer, more reliable systems.

READ

New SEI Tool Enhances Machine Learning Model Test and Evaluation

News Item

Machine Learning Test and Evaluation version 1.0 applies software engineering best practices to ensure ML model development results in production-ready ML models.

READ

Using Quality Attribute Scenarios for ML Model Test Case Generation

Conference Paper

This paper presents an approach based on quality attribute (QA) scenarios to elicit and define system- and model-relevant test cases for ML models.

Read