Bridging the Gap between Requirements Engineering and Model Evaluation in Machine Learning

As the use of artificial intelligence (AI) systems in real-world settings has increased, so has demand for assurances that AI-enabled systems perform as intended. Due to the complexity of modern AI systems, the environments they are deployed in, and the tasks they are designed to complete, providing such guarantees remains a challenge.

Defining and validating system behaviors through requirements engineering (RE) has been an integral component of software engineering since the 1970s. Despite the longevity of this practice, requirements engineering for machine learning (ML) is not standardized and, as evidenced by interviews with ML practitioners and data scientists, is considered one of the hardest tasks in ML development.

In this post, we define a simple evaluation framework centered around validating requirements and demonstrate this framework on an autonomous vehicle example. We hope that this framework will serve as (1) a starting point for practitioners to guide ML model development and (2) a touchpoint between the software engineering and machine learning research communities.

The Gap Between RE and ML

In traditional software systems, evaluation is driven by requirements set by stakeholders, policy, and the needs of different components in the system. Requirements have played a major role in engineering traditional software systems, and processes for their elicitation and validation are active research topics. AI systems are ultimately software systems, so their evaluation should also be guided by requirements.

However, modern ML models, which often lie at the heart of AI systems, pose unique challenges that make defining and validating requirements harder. ML models are characterized by learned, non-deterministic behaviors rather than explicitly coded, deterministic instructions. ML models are thus often opaque to end-users and developers alike, resulting in issues with explainability and the concealment of unintended behaviors. ML models are notorious for their lack of robustness to even small perturbations of inputs, which makes failure modes hard to pinpoint and correct.

Despite rising concerns about the safety of deployed AI systems, the overwhelming focus from the research community when evaluating new ML models is performance on general notions of accuracy and collections of test data. Although this establishes baseline performance in the abstract, these evaluations do not provide concrete evidence about how models will perform for specific, real-world problems. Evaluation methodologies pulled from the state of the art are also often adopted without careful consideration.

Fortunately, work bridging the gap between RE and ML is beginning to emerge. Rahimi et al., for instance, propose a four-step procedure for defining requirements for ML components. This procedure consists of (1) benchmarking the domain, (2) interpreting the domain in the data set, (3) interpreting the domain learned by the ML model, and (4) minding the gap (between the domain and the domain learned by the model). Likewise, Raji et al. present an end-to-end framework from scoping AI systems to performing post-audit activities.

Related research, though not directly about RE, indicates a demand to formalize and standardize RE for ML systems. In the space of safety-critical AI systems, reports such as the Concepts of Design for Neural Networks define development processes that include requirements. For medical devices, several methods for requirements engineering in the form of stress testing and performance reporting have been outlined. Similarly, methods from the ML ethics community for formally defining and testing fairness have emerged.

A Framework for Empirically Validating ML Models

Given the gap between evaluations used in ML literature and requirement validation processes from RE, we propose a formal framework for ML requirements validation. In this context, validation is the process of ensuring a system has the functional performance characteristics established by previous stages in requirements engineering prior to deployment.

Defining criteria for determining if an ML model is valid is helpful for deciding that a model is acceptable to use but suggests that model development essentially ends once requirements are fulfilled. Conversely, using a single optimizing metric acknowledges that an ML model will likely be updated throughout its lifespan but provides an overly simplified view of model performance.

The author of Machine Learning Yearning recognizes this tradeoff and introduces the concept of optimizing and satisficing metrics. Satisficing metrics determine levels of performance that a model must achieve before it can be deployed. An optimizing metric can then be used to choose among models that pass the satisficing metrics. In essence, satisficing metrics determine which models are acceptable and optimizing metrics determine which among the acceptable models are most performant. We build on these ideas below with deeper formalisms and specific definitions.

Model Evaluation Setting

We assume a fairly standard supervised ML model evaluation setting. Let f: X ↦ Y be a model. Let F be a class of models defined by their input and output domains (X and Y, respectively), such that f ∈ F. For instance, F can represent all ImageNet classifiers, and f could be a neural network trained on ImageNet.

To evaluate f, we assume there minimally exists a set of test data D={(x₁, y₁),…,(x_n, y_n)}, such that ∀_i∈[1,n] x_i∈ X, y_i∈ Y held out for the sole purpose of evaluating models. There may also optionally exist metadata D' associated with instances or labels, which we denote as x_i' ∈ X' and y_i' ∈ Y' for instance x_i and label y_i, respectively. For example, instance level metadata may describe sensing (such as angle of the camera to the Earth for satellite imagery) or environment conditions (such as weather conditions in imagery collected for autonomous driving) during observation.

Validation Tests

Moreover, let m:(F×P(D))↦ ℝ be a performance metric, and M be a set of performance metrics, such that m ∈ M. Here, P represents the power set. We define a test to be the application of a metric m on a model f for a subset of test data, resulting in a value called a test result. A test result indicates a measure of performance for a model on a subset of test data according to a specific metric.

In our proposed validation framework, evaluation of models for a given application is defined by a single optimizing test and a set of acceptance tests:

Optimizing Test: An optimizing test is defined by a metric m* that takes as D input. The intent is to choose m* to capture the most general notion of performance over all test data. Performance tests are meant to provide a single-number quantitative measure of performance over a broad range of cases represented within the test data. Our definition of optimizing tests is equivalent to the procedures commonly found in much of the ML literature that compare different models, and how many ML challenge problems are judged.

Acceptance Tests: An acceptance test is meant to define criteria that must be met for a model to achieve the basic performance characteristics derived from requirements analysis.
- Metrics: An acceptance test is defined by a metric m_i with a subset of test data D_i. The metric m_i can be chosen to measure different or more specific notions of performance than the one used in the optimizing test, such as computational efficiency or more specific definitions of accuracy.
- Data sets: Similarly, the data sets used in acceptance tests can be chosen to measure particular characteristics of models. To formalize this selection of data, we define the selection operator for the ith acceptance test as a function σ_i (D,D' ) = D_i⊆D. Here, selection of subsets of testing data is a function of both the testing data itself and optional metadata. This covers cases such as selecting instances of a specific class, selecting instances with common meta-data (such as instances pertaining to under-represented populations for fairness evaluation), or selecting challenging instances that were discovered through testing.
- Thresholds: The set of acceptance tests determine if a model is valid, meaning that the model satisfies requirements to an acceptable degree. For this, each acceptance test should have an acceptance threshold γ_i that determines whether a model passes. Using established terminology, a given model passes an acceptance test when the model, along with the corresponding metric and data for the test, produces a result that exceeds (or is less than) the threshold. The exact values of the thresholds should be part of the requirements analysis phase of development and can change based on feedback collected after the initial model evaluation.

An optimizing test and a set of acceptance tests should be used jointly for model evaluation. Through development, multiple models are often created, whether they be subsequent versions of a model produced through iterative development or models that are created as alternatives. The acceptance tests determine which models are valid and the optimizing test can then be used to choose from among them.

Moreover, the optimizing test result has the added benefit of being a value that can be tracked through model development. For instance, in the case that a new acceptance test is added that the current best model does not pass, effort may be undertaken to produce a model that does. If new models that pass the new acceptance test significantly lower the optimizing test result, it could be a sign that they are failing at untested edge cases captured in part by the optimizing test.

An Illustrative Example: Object Detection for Autonomous Navigation

To highlight how the proposed framework could be used to empirically validate an ML model, we provide the following example. In this example, we are training a model for visual object detection for use on an automobile platform for autonomous navigation. Broadly, the role of the model in the larger autonomous system is to determine both where (localization) and what (classification) objects are in front of the vehicle given standard RGB visual imagery from a front facing camera. Inferences from the model are then used in downstream software components to navigate the vehicle safely.

Assumptions

To ground this example further, we make the following assumptions:

The vehicle is equipped with additional sensors common to autonomous vehicles, such as ultrasonic and radar sensors that are used in tandem with the object detector for navigation.
The object detector is used as the primary means to detect objects not easily captured by other modalities, such as stop signs and traffic lights, and as a redundancy measure for tasks best suited for other sensing modalities, such as collision avoidance.
Depth estimation and tracking is performed using another model and/or another sensing modality; the model being validated in this example is then a standard 2D object detector.
Requirements analysis has been performed prior to model development and resulted in a test data set D spanning multiple driving scenarios and labeled by humans for bounding box and class labels.

Requirements

For this discussion let us consider two high-level requirements:

For the vehicle to take actions (accelerating, braking, turning, etc.) in a timely matter, the object detector is required to make inferences at a certain speed.
To be used as a redundancy measure, the object detector must detect pedestrians at a certain accuracy to be determined safe enough for deployment.

Below we go through the exercise of outlining how to translate these requirements into concrete tests. These assumptions are meant to motivate our example and are not to advocate for the requirements or design of any particular autonomous driving system. To realize such a system, extensive requirements analysis and design iteration would need to occur.

Optimizing Test

The most common metric used to assess 2D object detectors is mean average precision (mAP). While implementations of mAP differ, mAP is generally defined as the mean over the average precisions (APs) for a range of different intersection over union (IoU) thresholds. (For more definitions of IoU, AP, and mAP see this blog post.)

As such, mAP is a single-value measurement of the precision/recall tradeoff of the detector under a variety of assumed acceptable thresholds on localization. However, mAP is potentially too general when considering the requirements of specific applications. In many applications, a single IoU threshold is appropriate because it implies an acceptable level of localization for that application.

Let us assume that for this autonomous vehicle application it has been found through external testing that the agent controlling the vehicle can accurately navigate to avoid collisions if objects are localized with IoU greater than 0.75. An appropriate optimizing test metric could then be average precision at an IoU of 0.75 (AP@0.75). Thus, the optimizing test for this model evaluation is AP@0.75 (f,D) .

Acceptance Tests

Assume testing indicated that downstream components in the autonomous system require a consistent stream of inferences at 30 frames per second to react appropriately to driving conditions. To strictly ensure this, we require that each inference takes no longer than 0.033 seconds. While such a test should not vary considerably from one instance to the next, one could still evaluate inference time over all test data, resulting in the acceptance test max _x∈D interference_time (f(x)) ≤ 0.033 to ensure no irregularities in the inference procedure.

An acceptance test to determine sufficient performance on pedestrians begins with selecting appropriate instances. For this we define the selection operator σ_ped (D)=(x,y)∈D|y=pedestrian. Selecting a metric and a threshold for this test is less straightforward. Let us assume for the sake of this example that it was determined that the object detector should successfully detect 75 percent of all pedestrians for the system to achieve safe driving, because other systems are the primary means for avoiding pedestrians (this is a likely an unrealistically low percentage, but we use it in the example to strike a balance between models compared in the next section).

This approach implies that the pedestrian acceptance test should ensure a recall of 0.75. However, it’s possible for a model to attain high recall by producing many false positive pedestrian inferences. If downstream components are constantly alerted that pedestrians are in the path of the vehicle, and fail to reject false positives, the vehicle could apply brakes, swerve, or stop completely at inappropriate times.

Consequently, an appropriate metric for this case should ensure that acceptable models achieve 0.75 recall with sufficiently high pedestrian precision. To this end, we can utilize the metric, which measures the precision of a model when it achieves 0.75 recall. Assume that other sensing modalities and tracking algorithms can be employed to safely reject a portion of false positives and consequently precision of 0.5 is sufficient. As a result, we employ the acceptance test of precision@0.75(f,σ_ped (D)) ≥ 0.5.

Model Validation Example

To further develop our example, we performed a small-scale empirical validation of three models trained on the Berkeley Deep Drive (BDD) dataset. BDD contains imagery taken from a car-mounted camera while it was driven on roadways in the United States. Images were labeled with bounding boxes and classes of 10 different objects including a “pedestrian” class.

We then evaluated three object-detection models according to the optimizing test and two acceptance tests outlined above. All three models used the RetinaNet meta-architecture and focal loss for training. Each model uses a different backbone architecture for feature extraction. These three backbones represent different options for an important design decision when building an object detector:

The MobileNetv2 model: the first model used a MobileNetv2 backbone. The MobileNetv2 is the simplest network of these three architectures and is known for its efficiency. Code for this model was adapted from this GitHub repository.
The ResNet50 model: the second model used a 50-layer residual network (ResNet). ResNet lies somewhere between the first and third model in terms of efficiency and complexity. Code for this model was adapted from this GitHub repository.
The Swin-T model: the third model used a Swin-T Transformer. The Swin-T transformer represents the state-of-the-art in neural network architecture design but is architecturally complex. Code for this model was adapted from this GitHub repository.

Each backbone was adapted to be a feature pyramid network as done in the original RetinaNet paper, with connections from the bottom-up to the top-down pathway occurring at the 2nd, 3rd, and 4th stage for each backbone. Default hyper-parameters were used during training.

Test	Threshold	MobileNetv2	ResNet50	Swin-T
AP@0.75	(Optimizing)	0.105	0.245	0.304
max inference_time	< 0.033	0.0200	0.0233	0.0360
precision@0.75 (pedestrians)	≤ 0.5	0.103087448	0.597963712	0.730039841

Table 1: Results from empirical evaluation example. Each row is a different test across models. Acceptance test thresholds are given in the second column. The bold value in the optimizing test row indicates best performing model. Green values in the acceptance test rows indicate passing values. Red values indicate failure.

Table 1 shows the results of our validation testing. These results do represent the best selection of hyperparameters as default values were used. We do note, however, the Swin-T transformer achieved a COCO mAP of 0.321 which is comparable to some recently published results on BDD.

The Swin-T model had the best overall AP@0.75. If this single optimizing metric was used to determine which model is the best for deployment, then the Swin-T model would be selected. However, the Swin-T model performed inference more slowly than the established inference time acceptance test. Because a minimum inference speed is an explicit requirement for our application, the Swin-T model is not a valid model for deployment. Similarly, while the MobileNetv2 model performed inference most quickly among the three, it did not achieve sufficient precision@0.75 on the pedestrian class to pass the pedestrian acceptance test. The only model to pass both acceptance tests was the ResNet50 model.

Given these results, there are several possible next steps. If there are additional resources for model development, one or more of the models can be iterated on. The ResNet model did not achieve the highest AP@0.75. Additional performance could be gained through a more thorough hyperparameter search or training with additional data sources. Similarly, the MobileNetv2 model might be attractive because of its high inference speed, and similar steps could be taken to improve its performance to an acceptable level.

The Swin-T model could also be a candidate for iteration because it had the best performance on the optimizing test. Developers could investigate ways of making their implementation more efficient, thus increasing inference speed. Even if additional model development is not undertaken, since the ResNet50 model passed all acceptance tests, the development team could proceed with the model and end model development until further requirements are discovered.

Future Work: Studying Other Evaluation Methodologies

There are several important topics not covered in this work that require further investigation. First, we believe that models deemed valid by our framework can greatly benefit from other evaluation methodologies, which require further study. Requirements validation is only powerful if requirements are known and can be tested. Allowing for more open-ended auditing of models, such as adversarial probing by a red team of testers, can reveal unexpected failure modes, inequities, and other shortcomings that can become requirements.

In addition, most ML models are components in a larger system. Testing the influence of model choices on the larger system is an important part of understanding how the system performs. System level testing can reveal functional requirements that can be translated into acceptance tests of the form we proposed, but also may lead to more sophisticated acceptance tests that include other systems components.

Second, our framework could also benefit from analysis of confidence in results, such as is common in statistical hypothesis testing. Work that produces practically applicable methods that specify sufficient conditions, such as amount of test data, in which one can confidently and empirically validate a requirement of a model would make validation within our framework considerably stronger.

Third, our work makes strong assumptions about the process outside of the validation of requirements itself, namely that requirements can be elicited and translated into tests. Understanding the iterative process of eliciting requirements, validating them, and performing further testing activities to derive more requirements is vital to realizing requirements engineering for ML.

Conclusion: Building Robust AI Systems

The emergence of standards for ML requirements engineering is a critical effort towards helping developers meet rising demands for effective, safe, and robust AI systems. In this post, we outline a simple framework for empirically validating requirements in machine learning models. This framework couples a single optimizing test with several acceptance tests. We demonstrate how an empirical validation procedure can be designed using our framework through a simple autonomous navigation example and highlight how specific acceptance tests can affect the choice of model based on explicit requirements.

While the basic ideas presented in this work are strongly influenced by prior work in both the machine learning and requirements engineering communities, we believe outlining a validation framework in this way brings the two communities closer together. We invite these communities to try using this framework and to continue investigating the ways that requirements elicitation, formalization, and validation can support the creation of dependable ML systems designed for real-world deployment.

Software Engineering Institute

SEI Blog