Detection and Repair: The Cost of Remediation

Bringing an existing codebase into compliance with the SEI CERT Coding Standard requires a cost of time and effort. The typical way of assessing this cost is to run a static analysis tool on the codebase (noting that installing and maintaining the static analysis tool may incur its own costs). A simple metric for estimating this cost is therefore to count the number of static analysis alerts that report a violation of the CERT guidelines. (This assumes that fixing any one alert typically has no impact on other alerts, though often a single issue may trigger multiple alerts.) But those who are familiar with static analysis tools know that the alerts are not always reliable – there are false positives that must be detected and disregarded. Some guidelines are inherently easier than others for detecting violations.

This year, we plan on making some exciting updates to the SEI CERT C Coding Standard. This blog post is about one of our ideas for improving the standard. This change would update the standards to better harmonize with the current state of the art for static analysis tools, as well as simplify the process of source code security auditing.

For this post, we are asking our readers and users to provide us with feedback. Would the changes that we propose to our Risk Assessment metric disrupt your work? How much effort would they impose on you, our readers? If you would like to comment, please send an email to info@sei.cmu.edu.

The premise for our changes is that some violations are easier to repair than others. In the SEI CERT Coding Standard, we assign each guideline a Remediation Cost metric, which is defined with the following text:

Remediation Cost — How expensive is it to comply with the rule?
Value	Meaning	Detection	Correction
1	High	Manual	Manual
2	Medium	Automatic	Manual
3	Low	Automatic	Automatic

Furthermore, each guideline also has a Priority metric, which is the product of the Remediation Cost and two other metrics that assess severity (how consequential is it not to comply with the rule) and likelihood (how likely that violating the guideline leads to an exploitable vulnerability?). All three metrics can be represented as numbers ranging from 1 to 3, which can produce a product between 1 and 27 (that is, 3*3*3), where low numbers imply greater cost.

The above table could be alternately represented this way:

Is Automatically...	Not Repairable	Repairable
Not Detectable	1 (High)	1 (High)
Detectable	2 (Medium)	3 (Low)

This Remediation Cost metric was conceived back in 2006 when the SEI CERT C Coding Standard was first created. We did not use more precise definitions of detectable or repairable at the time. But we did assume that some guidelines would be automatically detectable while others wouldn’t. Likewise, we assumed that some guidelines would be repairable while others wouldn’t. Finally, a guideline that was repairable but not detectable would be assigned a High cost on the grounds that it was not worthwhile to repair code if we could not detect whether or not it complied with a guideline.

We also reasoned that the questions of detectability and repairability should be considered in theory. That is, is a satisfactory detection or repair heuristic possible? When considering if such a heuristic exists, you can ignore whether a commercial or open source product claims to implement the heuristic.

Today, the situation has changed, and therefore we need to update our definitions of detectable and repairable.

Detectability

A recent major change has been to add an Automated Detection section to every CERT guideline. This identifies the analysis tools that claim to detect – and repair – violations of the guideline. For example, Parasoft claims to detect violations of every rule and recommendation in the SEI CERT C Coding Standard. If a guideline’s Remediation Cost is High, indicating that the guideline is non-detectable, does that create incompatibility with all the tools listed in the Automated Detection section?

The answer is that the tools in such a guideline may be subject to false positives (that is, providing alerts on code that actually complies with the guideline), or false negatives (that is, failing to report some truly noncompliant code), or both. It is easy to construct an analyzer with no false positives (simply never report any alerts) or false negatives (simply alert that every line of code is noncompliant). But for many guidelines, detection with no false positives and no false negatives is, in theory, undecidable. Some attributes are easier to analyze, but in general practical analyses are approximate, suffering from false positives, false negatives, or both. (A sound analysis is one that has no false negatives, though it might have false positives. Most practical tools, however, have both false negatives and false positives.) For example, EXP34-C, the C rule that forbids dereferencing null pointers, is not automatically detectable by this stricter definition. As a counterexample, violations of rule EXP45-C (do not perform assignments in selection statements) can be detected reliably.

A suitable definition of detectable is: Can a static analysis tool determine if code violates the guideline with both a low false positive rate and low false negative rate? We do not require that there can never be false positives or false negatives, but we can require that they both be small, meaning that a tool’s alerts are complete and accurate for practical purposes.

Most guidelines, including EXP34-C, will, by this definition, be undetectable using the current crop of tools. This does not mean that tools cannot report violations of EXP34-C; it just means that any such violation might be a false positive, the tool might miss some violations, or both.

Repairability

Our notation of what is repairable has been shaped by recent advances in Automated Program Repair (APR) research and technology, such as the Redemption project. Specifically, the Redemption project and tool consider a static analysis alert repairable regardless of whether it is a false positive. Repairing a false positive should, in theory, not alter the code behavior. Furthermore, in Redemption, a single repair should be restricted to a local region and not distributed throughout the code. As an example, changing the number or types of a function’s parameter list requires modifying every call to that function, and function calls can be distributed throughout the code. Such a change would therefore not be local.

With that said, our definition of repairable can be expressed as: Code is repairable if an alert can be reliably fixed by an APR tool, and the only modifications to code are near the site of the alert. Furthermore, repairing a false positive alert must not break the code. For example, the null-pointer-dereference rule (EXP34-C) is repairable because a pointer dereference can be preceded by an automatically inserted null check. In contrast, CERT rule MEM31-C requires that all dynamic memory be freed exactly once. An alert that complains that some pointer goes out of scope without being freed seems repairable by inserting a call to free(pointer). However, if the alert is a false positive, and the pointer’s pointed-to memory was already freed, then the APR tool may have just created a double-free vulnerability, in essence converting working code into vulnerable code. Therefore, rule MEM31-C is not, with current capabilities, (automatically) repairable.

The New Remediation Cost

While the previous Remediation Cost metric did treat detectability and repairability as interrelated, we now believe they are independent and interesting metrics by themselves. A rule that was neither detectable nor repairable was given the same remediation cost as one that was repairable but not detectable, and we now believe these two rules should have these differences reflected in our metrics. We are therefore considering replacing the old Remediation Cost metric with two metrics: Detectable and Repairable. Both metrics are simple yes/no questions.

There is still the question of how to generate the Priority metric. As noted above, this was the product of the Remediation Cost, expressed as an integer from 1 to 3, with two other integers from 1 to 3. We can therefore derive a new Remediation Cost metric from the Detectable and Repairable metrics. The most obvious solution would be to assign a 1 to each yes and a 2 to each no. Thus, we have created a metric similar to the old Remediation Cost using the following table:

Is Automatically...	Not Repairable	Repairable
Not Detectable	1	2
Detectable	2	4

However, we decided that a value of 4 is problematic. First, the old Remediation Cost metric had a maximum of 3, and having a maximum of 4 skews our product. Now the highest priority would be 3*3*4=36 instead of 27. This would also make the new remediation cost more significant than the other two metrics. We decided that replacing the 4 with a 3 solves these problems:

Is Automatically...	Not Repairable	Repairable
Not Detectable	1	2
Detectable	2	3

Next Steps

Next will come the task of examining each guideline to replace its Remediation Cost with new Detectable and Repairable metrics. We must also update the Priority and Level metrics for guidelines where the Detectable and Repairable metrics disagree with the old Remediation Cost.

Tools and processes that incorporate the CERT metrics will need to update their metrics to reflect CERT's new Detectable and Repairable metrics. For example, CERT's own SCALe project provides software security audits ranked by Priority, and future rankings of the CERT C rules will change.

Here are the old and new metrics for the C Integer Rules:

Rule	Detectable	Repairable	New REM	Old REM	Title
INT30-C	No	Yes	2	3	Ensure that unsigned integer operations do not wrap
INT31-C	No	Yes	2	3	Ensure that integer conversions do not result in lost or misinterpreted data
INT32-C	No	Yes	2	3	Ensure that operations on signed integers do not result in overflow
INT33-C	No	Yes	2	2	Ensure that division and remainder operations do not result in divide-by-zero errors
INT34-C	No	Yes	2	2	Do not shift an expression by a negative number of bits or by greater than or equal to the number of bits that exist in the operand
INT35-C	No	No	1	2	Use correct integer precisions
INT36-C	Yes	No	2	3	Converting a pointer to integer or integer to pointer

In this table, New REM (Remediation Cost) is the metric we would produce from the Detectable and Repairable metrics, and Old REM is the current Remediation Cost metric. Clearly, only INT33-C and INT34-C have the same New REM values as Old REM values. This means that their Priority and Level metrics remain unchanged, but the other rules would have revised Priority and Level metrics.

Once we have computed the new Risk Assessment metrics for the CERT C Secure Coding Rules, we would next handle the C recommendations, which also have Risk Assessment metrics. We would then proceed to update these metrics for the remaining CERT standards: C++, Java, Android, and Perl.

Auditing

The new Detectable and Repairable metrics also alter how source code security audits should be conducted.

Any alert from a guideline that is automatically repairable could, in fact, not be audited at all. Instead, it could be immediately repaired. If an automated repair tool is not available, it could instead be repaired manually by developers, who may not care whether or not it is a true positive. An organization may choose whether to apply all of the potential repairs or to review them; they could apply extra effort to review automatic repairs, but this may only be necessary to satisfy their standards of software quality and their trust in the APR tool.

Any alert from a guideline that is automatically detectable should also, in fact, not be audited. It should be repaired automatically with an APR tool or sent to the developers for manual repair.

This raises a potential question: Detectable guidelines should, in theory, almost never yield false positives. Is this actually true? The alert might be false due to bugs in the static analysis tool or bugs in the mapping (between the tool and the CERT guideline). We could conduct a series of source code audits to confirm that a guideline truly is automatically detectable and revise guidelines that are not, in fact, automatically detectable.

Only guidelines that are neither automatically detectable nor automatically repairable should actually be manually audited.

Given the huge number of SA alerts generated by most code in the DoD, any optimizations to the auditing process should result in more alerts being audited and repaired. This will lessen the effort required in addressing alerts. Many organizations do not address all alerts, and they consequently accept the risk of un-resolved vulnerabilities in their code. So instead of reducing effort, this improved process reduces risk.

This improved process can be summed up by the following pseudocode:

For each alert:
- If alert is repairable
  - If we have an APR tool to repair alert:
    - Use APR tool to repair alert
  - else (No APR tool)
    - Send alert to developers for manual repair
- else (Alert is not repairable)
  - if alert is detectable:
    - Send alert to developers for manual repair
  - else (Alert is not detectable)
    - Send alert to auditors

Your Feedback Needed

We are publishing this specific plan to solicit feedback. Would these changes to our Risk Assessment metric disrupt your work? How much effort would they impose on you? If you would like to comment, please send an email to info@sei.cmu.edu.

Software Engineering Institute

SEI Blog