search menu icon-carat-right cmu-wordmark

Performance of Compiler-Assisted Memory Safety Checking

Headshot of David Keaton

According to a 2013 report examining 25 years of vulnerabilities (from 1998 to 2012), buffer overflow causes 14 percent of software security vulnerabilities and 35 percent of critical vulnerabilities, making it the leading cause of software security vulnerabilities overall. As of July 2014, the TIOBE index indicates that the C programming language, which is the language most commonly associated with buffer overflows, is the most popular language with 17.1 percent of the market. Embedded systems, network stacks, networked applications, and high-performance computing rely heavily upon C. Embedded systems can be especially vulnerable to buffer overflows because many of them lack hardware memory management units. This blog post describes my research on the Secure Coding Initiative in the CERT Division of the Carnegie Mellon University Software Engineering Institute to create automated buffer overflow prevention.

For greenfield software development projects, which are projects developed without constraints from any previous system, it may be practical to mandate a language subset or annotations. Most development efforts, however, are extensions or modifications of existing code. It is therefore necessary to find a solution to eliminating buffer overflows that can work with legacy code.

The catalyst for this research occurred prior to my arrival at CERT, while I was consulting for a large company that deployed a 3-million-line code base. The same potential buffer overflow problem kept cropping up, so I launched a project at that company to remediate the problem.

We picked several thousand instances of the potential buffer overflow bug. We then assembled 10 engineers and, after dividing the work equally, went in and fixed the bugs manually. What we found was that while we were successful in preventing a lot of bugs associated with buffer overflows, we introduced more bugs than we would have introduced had we been writing the code from scratch. We knew this because that company kept very good records about the quality of their code at each stage.

Clearly, our approach was sound, but how we went about it was wrong. We needed an automated approach to buffer overflow protection to eliminate the potential for human error.

Foundations of an Automated Approach

Because C focuses on performance and being close to the hardware, bounds checks on array references and pointer dereferences are not part of the language. Sometimes a compiler can prove at compile time that an access is in bounds, but usually this is not the case. If the user desires bounds checking, then for the memory accesses that cannot be resolved at compile time, the compiler must insert additional instructions into the generated code to perform the checks. The time the processor takes to execute those extra instructions at run time is the performance overhead of a bounds checking mechanism.

I have found that many researchers report excellent low overheads for automated buffer overflow elimination schemes while in practical use they are actually substantially higher. For example, one set of researchers may report very low overheads by taking advantage of a trap on misaligned memory accesses, but many processors do not trap when dereferencing misaligned pointers. As a result, the check for misalignment must be performed by software, which causes a much higher overhead. Another set of researchers may use a shadow memory by dividing up the address space for the program and for the overflow checking. Again, this may not work in the practical domain because it requires operating system modifications.

I investigated what performance could be achieved in the practical realm and how that performance could be improved for deployment of an automated, compiler-based memory safety checking tool. Because the goal was to support legacy code, there were additional constraints:

  • any approach used could not require changes to the source code
  • application binary interface (ABI) compatibility should also be maintained because developers do not have control over the complete system on which the application will run
  • it should be possible to link checked code with unchecked binary libraries for which the source code might not be available

I began with two memory safety checkers that meet these criteria, SAFECode and SoftBound.

Methodology

I selected a small set of programs to use as a benchmark from the SPEC CPU 2006 suite because they are written in C and compile cleanly with Clang/LLVM 3.2, the latest version of the compiler for which both SAFECode and SoftBound were available. I measured the overhead with just the source code available on the Internet and then applied my own optimizations and measured it again.

As described in an August 2014 technical note I wrote with Robert Seacord, Performance of Compiler-Assisted Memory Safety Checking, I made three performance enhancements to SoftBound to investigate their effects on performance.

First, I hoisted spatial memory access checks out of loops when the loop bounds were known on entry. As an example, consider the following function:

#include <stddef.h>

void foo(int *a)
{
for (size_t i = 0; i < 100; ++i)
a[i] = i;
}

The figure below shows the generated LLVM code, including the effect of hoisting the check of the store to a[i] out of the loop. The optimization removes the struck-out text preceded by minus signs in the figure and inserts the bold text preceded by plus signs. The beginning of the check is adjusted to be the first element accessed (the beginning of array a), and the length of the check is adjusted to include all accesses that will be performed by the loop (400 bytes), rather than checking each iteration separately.

define void @foo(i32* nocapture %a) nounwind uwtable ssp {
entry:
%0 = tail call i8* @__softboundcets_load_base_shadow_
stack(i32 1) nounwind
%1 = tail call i8* @__softboundcets_load_bound_shadow_
stack(i32 1) nounwind
+ %bitcast = bitcast i32* %a to i8*
+ tail call void @__softboundcets_spatial_store_
dereference_check(i8* %0, i8* %1, i8* %bitcast, i64 400) nounwind

br label %for.body

for.body: ; preds
= %for.body, %entry
%i.04 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
%conv = trunc i64 %i.04 to i32
%arrayidx = getelementptr inbounds i32* %a, i64 %i.04
- %bitcast = bitcast i32* %arrayidx to i8*
- tail call void @__softboundcets_spatial_store_dereference_
check(i8* %0, i8* %1, i8* %bitcast, i64 4) nounwind
store i32 %conv, i32* %arrayidx, align 4, !tbaa !0
%inc = add i64 %i.04, 1
%exitcond = icmp eq i64 %inc, 100
br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds
= %for.body
ret void
}

Figure 1: LLVM Code Showing the Effect of Hoisting a Bounds Check Out of a Loop

To prevent spurious error reports, with the check being executed and the offending access not executed, the check is hoisted only if it post-dominates the first basic block inside the loop. The check post-dominates the loop's first basic block if all possible execution paths from that basic block to a loop exit pass through the check.

I next hoisted bounds checks out of a function and into its callers when the compiler could see all calls to the function, so that a bounds check will be executed somewhere (if necessary) if it is deleted from its original function. To see how this might be beneficial, consider the following program:

#include <stdio.h>
#include <stdlib.h>

static void foo(unsigned char *a)
{
for (size_t i = 0; i < 100; ++i)
a[i] = i;
}

int main(void)
{
unsigned char a[100];

foo(a);
for (size_t i = 0; i < 100; ++i)
printf(" %d", a[i]);
putchar('\n');

return EXIT_SUCCESS;
}

If the bounds check for the access to a[i] is first hoisted outside the loop in foo and then up to main, it is now in a position where the size of array a is known. Because the length of the bounds check is also known, it can be compared against the size of a to determine if the bounds check can be eliminated entirely. This elimination occurs in the case of the preceding program. If, after hoisting the bounds check into the caller, there still is not enough information to eliminate the bounds check, it is performed within the caller in hopes that it can still be eliminated along other call paths to the original function.

The mechanism used to accomplish this optimization is to treat a bounds check in the called function as a precondition for that function (called a requirement by Plum and Keaton). In this case the precondition is that the memory space pointed to by a is at least 400 bytes long. Then all requirements are checked at their call sites to see whether the bounds checks can be eliminated for that call path or merely hoisted out of the called function.

Inlining can accomplish the same thing by effectively hoisting a bounds check into the calling function, where there might be enough information to eliminate the check. Therefore, this mechanism provides a benefit in cases where inlining is not performed, such as when the called function is large.

SoftBound is implemented so that it operates on optimized code. First the optimizations are run, then SoftBound is run, and then the optimizations are repeated in an attempt to improve any code added by SoftBound. I found that unrolling loops thwarted some attempts to hoist bounds checks. Fully unrolled loops contained a sequence of memory accesses in straight-line code in place of one access within a loop. I therefore disabled loop unrolling. Alternative approaches would have been to disable it only for the first pass of optimizations or to write an additional optimization to combine the adjacent bounds checks that result from unrolling loops.

The third change was to test the performance of bounds checks on stores only (to prevent arbitrary code execution), or on strings only (because incorrect string management is a leading cause of vulnerabilities), or only on stores to strings. Limiting the bounds checks in this way can provide some insight into the tradeoff between security and performance.

I also measured the performance of SAFECode and SoftBound with checking turned off, to discover the performance effect of the maintenance and propagation of metadata via the object table maintained by SAFECode or the pointer table maintained by SoftBound. To accomplish this, I disabled the portions of SAFECode that emit pointer arithmetic checks and bounds checks, and disabled the portions of SoftBound that emit bounds checks, leaving the bookkeeping code only and thereby establishing a ceiling for the performance benefit of bounds check optimizations.

Results

Overall, with the optimization of hoisting bounds checks out of loops, the average program ran 4.72 times slower with buffer overflow checking, a rate that was significantly slower than I had hoped. The programs experienced an average slowdown of 5.36 times without the optimization.

Next, as is also detailed in the technical note, I charted the slowdown for performing bounds checks only on stores, only on strings, and only on stores to strings, in addition to hoisting bounds checks out of loops and functions. With the optimization plus checking only stores, the average program ran 2.58 times slower than it did without buffer overflow checking. All three cases provided a performance benefit. The three are similar in magnitude, indicting that a useful tradeoff between performance and security may be achieved by checking only stores.

Next, I investigated the performance breakdown by examining the metadata propagation performed by SAFECode and SoftBound to gain more insight as to how the time is spent. There are two parts to run-time automatic buffer overflow elimination. One part involves tracking the bounds information that is needed to check for buffer overflows. This metadata must be available for the other part, the actual checking against the bounds to detect buffer overflows. In SAFECode, the actual checking occurs at pointer arithmetic and at loads and stores. In SoftBound, the actual checking occurs only at loads and stores.

When I turned off all of the checking code except for what was needed to pass the bounds data around inside the program, I found that the overhead required by metadata propagation by itself yielded an average overhead of 2.07, which is almost the entirety of the overhead. For SoftBound, which is the main codebase that I was working with for buffer overflow elimination, there would be a vast advantage to working on metadata propagation as opposed to the checks that occur at each store.

I compared this with SAFECode. SAFECode has much worse overhead overall. The average overhead, without any optimization, was 41.72 times slowdown with the buffer overflow checking, but with no optimization.

When I checked the overhead of SAFECode's metadata propagation only, the average slowdown was only 3.02, which is not much worse than SoftBounds' metadata propagation overhead, which was 2.07.

SAFECode's slowdown was 41.72, and its metadata slowdown was only 3.02, which clearly shows that in SAFECode the most benefit can be gained by increasing the speed at which checks occur at pointer arithmetic and at loads and stores.

The conclusion drawn from the results above is that each pointer overflow checking mechanism needs to be considered separately.

Challenges and Adjustments

In addition to hoisting bounds checks out of loops, I also tried to hoist them out of functions. I reasoned that if I hoisted a bounds check into the caller, there might be information in the calling function to reveal the size of the objects, eliminating the need for a bounds check.

Unfortunately, hoisting bounds checks out of functions did not provide a significant benefit. One reason is that this issue is already addressed by inlining. So, if I have a small function working on an array, it will often get inlined into the larger function and then see the object in the calling function at compile time.

Another challenge involved the physical structure of the LLVM compiler, which is divided into many different passes. LLVM will perform a pass through the code that it is compiling, and then it will perform another pass. There are function passes, which are performed once for every function in the program, and there are module passes, which are performed once for every module, which is a file (i.e., a collection of functions).

SoftBound, it turns out, is a module pass so that it can operate on multiple functions concurrently. In the way LLVM is structured, however, a module pass cannot use information generated by a function pass, which uses information generated by another module pass. This design decision allows for the parallelization of different passes, but negatively impacts constant folding. As a result, constant folding does not occur as intended. If the code contains the computation, 2+2, it will remain as a 2+2 because constant folding is a function pass that makes use of information from another module pass.

To mitigate the issue, I added the constant folding myself, manually, to make it work. Now, the 2+2 comes out as a 4, which makes optimization much easier.

Looking Ahead

There is hope that automated buffer overflow checking will one day perform fast enough to work in future performance-critical systems. After this study was finished, Intel announced a set of future hardware protections called "MPX" Memory Protection Extensions. Once Intel installs those hardware improvements, it is possible that we will be able to use compiler enforced buffer overflow elimination on performance-critical code as well.

I welcome feedback on this research in the comments section below.

Additional Resources

To read the technical note, Performance of Compiler-Assisted Memory Safety Checking, by David Keaton and Robert C. Seacord, please visit
https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=299175.

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed