Measurement Challenges in Software Assurance and Supply Chain Risk Management

Software supply chain risk has increased exponentially since 2009 when the perpetrators of the Heartland Payments System breach reaped 100 million debit and credit card numbers. Subsequent events in 2020 and 2021, such as SolarWinds and Log4j, show that the scale of disruption from a third-party software supplier can be massive. In 2023, the MOVEit vulnerability compromised the information of 1.6 million individuals and cost businesses more than $9.9 billion. Part of this risk can be ascribed to software reuse, which has enabled faster fielding of systems but which can also introduce vulnerabilities. A recent report by SecurityScorecard found that 98 percent of the 230,000 organizations it sampled have had third-party software components breached within the prior two years.

Limitations in measuring software assurance directly impact the ability of organizations to address software assurance across the lifecycle. Leadership throughout the supply chain continues to underinvest in software assurance, especially early in the lifecycle. Consequently, design decisions tend to lock in weaknesses because there is no means to characterize and measure acceptable risk. This SEI Blog post examines the current state of measurement in the area of software assurance and supply chain management, with a particular focus on open source software, and highlights some promising measurement approaches.

Measurement in the Supply Chain

In the current environment, suppliers rush to deliver new features to motivate buyers. This rush, however, comes at the expense of time spent analyzing the code to remove potential vulnerabilities. Too often, buyers have limited means to evaluate the risk in products they acquire. Even if a supplier addresses an identified vulnerability quickly and issues a patch, it is up to the users of that software to apply the fix. Software supply chains are many levels deep, and too frequently the patches apply to products buried deep within a chain. In one example from an open source software project, we counted just over 3,600 unique software component dependencies traversing nearly 35 levels “deep” (that is ‘a’ depends on ‘b’ which depends on ‘c’ and so on). Each layer must apply the patch and send an update up the chain. This can be a slow and faulty process, since knowledge of where each specific product has been used is limited for those higher in the chain. Recent mandates to create software bills of materials (SBOMs) support an attempt to improve visibility, but the fix still needs to be addressed by each of the many layers that contain the vulnerability.

The Open Source Security Foundation (OSSF) Scorecard incorporates a set of metrics that can be applied to an open source software project. The idea is that those project attributes that OSSF believes contribute to a more secure open source application are then reported using a weighted approach that leads to a score.

From a metrics perspective, there are limitations to this approach:

The open source community is driving and evolving which items to measure and, therefore, what to build into the tool.
The relative importance of each factor is also built into the tool, which makes it difficult (but not impossible) to tailor the results to specific, custom, end-user needs.
Many of the items measured in the tool appear to be self-reported by the developer(s) versus validated by a third party, but this is a common “attribute” of open source projects.

Other tools, such as MITRE’s Hipcheck, have the same limitations. For an OSSF project, it is possible to get a score for the project using Scorecard and scores for the individual dependency projects, but questions arise from this approach. How do those individual scores roll up into the overall score? Do you pick the lowest score across all the dependencies, or do you apply some sort of weighted average of scores? Furthermore, a recent research paper indicated that cases in which open source projects scored highly by Scorecard might, in fact, produce packages that have more reported vulnerabilities. Issues such as these indicate further study is needed.

Measuring Software Cybersecurity Risk: State of the Practice

Currently, it is possible to collect vast amounts of data related to cybersecurity in general. We can also measure specific product characteristics related to cybersecurity. However, while much of the data collected reflects the results of an attack, whether attempted or successful, data on earlier security lifecycle activities often is not diligently collected, nor is it analyzed as thoroughly as in later points of the lifecycle.

As software engineers, we believe that improved software practices and processes will result in a more robust and secure product. However, which specific practices and processes actually result in a more secure product? There can be quite a bit of elapsed time between the implementation of improved processes and practices and the subsequent deployment of the product. If the product is not successfully attacked, does it mean that it’s more secure?

Certainly, government contractors have a profit motive that justifies meeting the cybersecurity policy requirements that apply to them, but do they know how to measure the cybersecurity risk of their products? And how would they know whether it has improved sufficiently? For open source software, when developers are not compensated, what would motivate them to do this? Why would they even care whether a particular organization—be it academic, industry, or government—is motivated to use their product?

Measuring Software Cybersecurity Risk: Currently Available Metrics

The SEI led a research effort to identify the metrics currently available within the lifecycle that could be used to provide indicators of potential cybersecurity risk. From an acquisition lifecycle perspective, there are two critical questions to be addressed:

Is the acquisition headed in the right direction as it is engineered and built (predictive)?
Is the implementation sustaining an acceptable level of operational assurance (reactive)?

As development shifts further into Agile increments, many of which include third-party and open source components, different tools and definitions are applied to collecting defects. Consequently, the meaning of this metric in predicting risk becomes obscured.

Highly vulnerable components implemented using effective and well-managed zero trust principles can deliver acceptable operational risk. Likewise, well-constructed, high-quality components with weak interfaces can be highly prone to successful attacks. Operational context is critical to the risk exposure. A simple evaluation of each potential vulnerability using something like a Common Vulnerability Scoring System (CVSS) score can be extremely misleading since the score without the context has limited value in determining actual risk.

However, the lack of visibility into the development processes and methods used to develop third-party software—particularly open source software—means that measures related to the processes used and the errors found prior to deployment, if they exist, do not add to the useful information about the product. This lack of visibility into product resilience as it relates to the process used to develop it means that we do not have a full picture of the risks, nor do we know whether the processes used to develop the product have been effective. It’s difficult, if not impossible, to measure what is not visible.

Measurement Frameworks Applied to Cybersecurity

Early software measurement was basically concerned with tracking tangible items that provided immediate feedback, such as lines of code or function points. Consequently, many different ways of measuring code size were developed.

Eventually, researchers considered code quality measures. Complexity measures were used to predict code quality. Bug counts in trouble reports, errors found during inspection, and mean time between failures drove some measurement efforts. Through this work, evidence surfaced that suggested it was less costly to locate and correct errors early in the software lifecycle rather than later. However, convincing development managers to spend more money upfront was a tough sell given that their performance evaluations heavily relied on containing development costs.

A few dedicated researchers tracked the measurement results over a long period of time. Basili and Rombach’s seminal work in measurement resulted in the Goal-Question-Metric (GQM) method for helping managers of software projects decide what measurement data would be useful to them. Building on this seminal work, the SEI created the Goal, Question, Indicator, Metric (GQIM) method. In the GQIM, indicators identify information needed to answer each question. Then, in turn, metrics are identified that use the indicators to answer the question. This additional step reminds stakeholders of the practical aspects of data collection and provides a way of ensuring that the needed data is collected for the selected metrics. This method has already been applied by both civilian and military stakeholders.

Similar data has been collected for cybersecurity, and it shows that it is less costly to correct errors that might lead to vulnerabilities early in the lifecycle rather than later, when software is operational. The results of those studies help answer questions about development cost and reinforce the importance of using good development processes. In that regard, those results support our intuition. For open source software, if there is no visibility into the development process, we lack information about process. Furthermore, even when we know something about the development process, the total cost associated with a vulnerability after software is operational can range from zero (if it is never found and exploited) to millions of dollars.

Over the history of software engineering, we have learned that we need software metrics for both the process and the product. This is no different in the case of the cybersecurity of open source software. We must be able to measure the processes for developing and using software and how those measurement results affect the product’s cybersecurity. It is insufficient to measure only operational code, its vulnerabilities, and the attendant risk of successful hacks. In addition, success hinges on a collaborative, unbiased effort that allows multiple organizations to participate under a suitable umbrella.

Primary Buyers Versus Third-Party Buyers

Three cases apply when software is acquired rather than developed in house:

Acquirers of custom contract software can require that the contractor provide visibility into both their development practices and their SCRM plan.
Acquirers can specify the requirements, but the development process is not visible to the buyer and the acquirer has little say over what occurs in such development processes.
The software product already exists, and the buyer is typically just purchasing a license. The code for the product may or may not be visible, further limiting what can be measured. The product could also, in turn, contain code developed further down in the supply chain, thus complicating the measurement process.

Open source software resembles the third case. The code is visible, but the process used to develop it is invisible unless the developers choose to describe it. The value of having this description depends on the acquirer’s ability to determine what is good versus poor quality code, what is a good development process, and what is a high quality assurance process.

Today, many U.S. government contracts require the supplier to have an acceptable SCRM plan, the effectiveness of which can presumably be measured. Nevertheless, a deep supply chain—with many levels of buyers and dependencies—clearly is concerning. First, you have to know what is in the chain, then you have to have a way of measuring each component, and finally you need trustworthy algorithms to produce a bottom line set of measurements for the final product constructed from a chain of products. Note that when a DoD’s supplier also incorporates other proprietary or open-source software, that supplier now becomes an acquirer and is beset with the same challenges as a third-party buyer.

Measuring the risks associated with the attack surface of the ultimate product is helpful but only if you can determine what the attack surface is. With open source, if the build picks up the latest version of the product, the measurement process should be revisited to ensure you still have a valid bottom line number. However, this approach presents a number of questions:

Is measurement being done?
How effective is the measurement process and its results?
Is measurement repeated every time a component in the product/build changes?
Do you even know when a component in the product/build changes?

Examples of Potentially Useful Measures

An extensive three-year study of security testing and analysis by Synopsys revealed that 92 percent of tests discovered vulnerabilities in the applications being tested. Despite showing improvement year over year, the numbers still present a grim picture of the current state of affairs. In this study, improvements in open source software appeared to result from improved development processes, including inspection and testing. However, older open source software that is no longer maintained still exists in some libraries, and it can be downloaded without those corresponding improvements.

This study and others indicate that the community has started making progress in this area by defining measures that go beyond identifying vulnerabilities in open source software while keeping in mind that the goal is to reduce vulnerabilities. Measures that are effective in SCRM are relevant to open source software. An SEI technical note discusses how the Software Assurance Framework (SAF) illustrates promising metrics for specific activities. The note demonstrates Table 1 below, which pertains to SAF Practice Area 2.4 Program Risk Management and addresses the question, “Does the program manage program-level cybersecurity risks?”

Table 1: Promising Metrics for Specific Software Assurance Activities

Activities/Practices

Outputs

Candidate Metrics

Ensure that project strategies and plans address project-level cybersecurity risks (e.g., program risks related to cybersecurity resources and funding).

Program Plan

Technology Development Strategy (TDS)

Analysis of Alternatives (AoA)

% program managers receiving cybersecurity risk training

% programs with cybersecurity related risk management plans

Identify and manage project-level cybersecurity risks (e.g., program risks related to cybersecurity resources and funding).

Risk Management Plan Risk Repository

% programs with cybersecurity related risks

# cybersecurity related risks tracked per month

The Emerging Need for Software Assurance Metrics Standards

Once we understand all the metrics needed to predict cybersecurity in open source software, we will need standards that make it easier to apply these metrics to open source and other software in the supply chain. Providers could consider including software products that come with metrics that help users understand the product’s cybersecurity posture. As an example, at the operational level, the Vulnerability Exploitability eXchange (VEX) helps users understand whether or not a particular product is affected by a specific vulnerability. Such publicly available information can help users make choices about open source and other products in the supply chain. Of course, this is just one example of how data might be collected and used, and it focuses on vulnerabilities in existing software.

Similar standard ways of documenting and reporting cybersecurity risk are needed throughout the software product development process. One of the challenges that we have faced in analyzing data is that when it is collected, it may not be collected or documented in a standard way. Reports are often written in unstructured prose that is not amenable to analysis, even when the reports are scanned, searched for key words and phrases, and analyzed in a standard way. When reports are written in a non-standard way, analyzing the content to achieve consistent results is challenging.

We have provided some examples of potentially useful metrics, but data collection and analysis will be needed to validate that they are, in fact, useful in the supply chains that include open source software. This validation requires standards that support data collection and analysis methods and evidence that affirms the usefulness of a specific method. Such evidence may start with case studies, but these need to be reinforced over time with numerous examples that clearly demonstrate the utility of the metrics in terms of fewer hacks, reduced expenditure of time and money over the life of a product, enhanced organizational reputation, and other measures of value.

New metrics that have not yet been postulated must also be developed. Some research papers may describe novel metrics along with a case study or two. However, the massive amount of data collection and analysis needed to truly have confidence in these metrics seldom happens. New metrics either fall by the wayside or are adopted willy-nilly because renowned researchers and influential organizations endorse them, whether or not there is sufficient evidence to support their use. We believe that defining metrics, collecting and analyzing data to illustrate their utility, and using standard methods requires unbiased collaborative work to take place for the desired results to come to fruition.

Software Engineering Institute

SEI Blog