Don't Play Developer Testing Roulette: How to Use Test Coverage
Suppose someone asked you to play Russian Roulette. Although your odds of surviving are 5 to 1 (83 percent), it is hard to imagine how anyone would take that risk. But taking comparable risk owing to incomplete software testing is a common practice. Releasing systems whose tests achieve only partial code coverage—the percentage of certain elements of a software item that have been exercised during its testing—is like spinning the barrel and hoping for the best, or worse, believing there is no risk. This post is partly a response to questions I'm frequently asked when working with development teams looking for a definitive answer to what adequate testing means: Is code coverage really useful? Is 80 percent code coverage enough? Does it even matter?
A software testing coverage report identifies untested code and, more importantly, is essential for designing compact and effective test suites. Coverage should never be used as the primary criterion of test completeness, but it should always be checked to reveal test design misunderstandings and omissions. Although this idea is neither new nor controversial—see Brian Marick's, "How to Misuse Code Coverage" published in the 1999 proceedings of the 16th International Conference on Testing Computer Software—the extent to which it is unknown, misunderstood, or ignored continues to surprise me. This post, the first of two, explains how code coverage is computed, what it means, and why partial coverage is an unnecessary risk. In the second post, I offer a definition of done that uses adequate coverage and six best practices to routinely achieve high consistency and effectiveness in developer-conducted testing. Together, they explain in practical terms how to achieve an effective software testing strategy using coverage. Links are provided to results from research and practice that support key elements of the case for testing coverage.
What is Code Coverage?
Code coverage is the percent of certain elements of a software item that have been exercised during its testing. There are many ideas about which code elements are important to test and therefore many kinds of code coverage. Some are code-based (white-box) and some are behavior-based (black-box). Open source and commercial coverage tools are available for all popular and many specialized programming languages. A coverage tool typically adds trace statements to the software item under test (SIUT) before it is tested. This instrumented SIUT is built and a suite of test cases runs it, which produces a trace of elements executed. Next, the coverage tool analyzes this trace to report which elements were executed. A coverage report is specific to both the tests used and tested version of the software item. The instrumented code is typically discarded.
For example, if a certain test suite causes 400 out of 500 SIUT source code statements to execute at least once, we say this test suite achieves 80 percent statement coverage. A test suite that causes each code block contingent on a conditional expression to be executed at least once is said to achieve decision or branch coverage. For certification of aircraft software systems, the FAA requires modified condition decision coverage (MCDC), an advanced form of decision coverage.
Coverage is never the number of tests run or the percentage of the number of tests run that pass or fail. When coverage is used without qualification, statement coverage of code is usually assumed. When used without a giving a percentage, 100 percent is usually assumed. Measuring code coverage is useful for developer testing, but much less so for integration or system scope testing.
Why Should We Care about Code Coverage?
Designing practical and effective software tests is a fascinating and often frustrating puzzle. Think of code coverage as a checklist for places to look for bugs. Just like looking for your misplaced keys, you'll probably try (usually without success) the most obvious places first: the pockets of the coat you last wore, the kitchen counter, etc. You wouldn't skip those places, but neither would you conclude your keys are irreplaceably lost if they are not there.
We need focus when we test software because truly exhaustive software testing of any kind is impossible. There are an astronomically large number of execution conditions where bugs and vulnerabilities can hide, so exhaustive testing would require an astronomically large number of tests. Even very extensive test suites can reach only a very tiny subset of all execution conditions. Moreover, unless SIUT code is truly atrocious, a very high proportion of its execution conditions will perform correctly, even when bugs are present. Simply executing a buggy statement is not sufficient. The data it uses must be such that the bug is triggered and then produces an observable failure. The Y2K software pandemic is just one example of the interplay of these criteria. When code that had worked without trouble for decades tried to process a century-less date of the new millennium, it would crash and/or produce incorrect results. So why, you may ask, should we give code coverage any credit?
That's because we have a zero chance of revealing bugs in code that isn't tested. Some might ask, But doesn't testing exercise a code unit as a whole? Yes, but it is very easy to run lots of tests and not exercise all the elements of a code unit. Recall that 100 percent statement coverage means that every line of code has been executed in a test run at least once. It is however, the barest minimum adequacy criterion--it is the dirt floor of white-box testing strategies--because it is a certainty that not executing a buggy statement means there is no chance of revealing a bug in that statement. This is why a test suite must at least execute (cover) every statement to have a slim chance (but not a guarantee) of revealing latent bugs.
The essential task of test design is to wisely select from a practically infinite number of possible tests, knowing that bugs are present but not knowing exactly which test cases will reveal them. To choose wisely, therefore, test design tries to identify test cases that have a good chance of revealing bugs within available time and resources. Coverage helps to focus and make sure we have tried certain code elements at least once. Even though exhaustive code-based testing cannot reveal the omission of a necessary feature, testing evaluated with coverage can often lead to insights that reveal omissions.
The coverage reports from a comprehensive test suite can also reveal vulnerabilities and malicious code. Uncovered code may indicate an intentionally created logic bomb or "Easter egg" exploit. This situation is most likely to occur with software of unknown provenance (SOUP). Likewise, uncovered code can reveal dead or unused code that can be overwritten with malicious code, a vulnerability that should be addressed.
Code Coverage Roulette
Many developers see 100 percent coverage (of any kind) as impractical and therefore claim that some lesser number is acceptable. I've heard and seen partial coverage thresholds touted many times over 30 years: 70 percent, 80 percent, or 85 percent. The exact number is not important. What is important is that a partial coverage threshold provides a ready excuse for arbitrary and weak testing. The justification for a partial coverage threshold usually goes something like this:
Execution of code elements often depends on conditions that the SIUT does not control, such as runtime exceptions, messages from external systems, or shared data state. If the conditions that cause such a blockage cannot be achieved using the SIUT's public interface, they can't be easily executed, hence covered. While it may be possible to get blocked code to execute using additional test code, stubs, mocks, or drivers, the incremental coverage is not worth the additional work.
Therefore, many developers are adamant that requiring 100 percent statement coverage for all test suites is a goal that only Dilbert's pointy haired boss would insist on. Thus 80-ish percent statement coverage is touted as a practical compromise.
I have never seen any evidence to support a particular partial coverage threshold. I have, however, seen many cases where such a threshold has become the acceptable test completion criterion, regardless of whether there are actual technical obstacles or not. Using a partial coverage as a criterion for test completeness usually results in a superficial exercise of easily reachable elements, without exercising interaction of the covered and blocked code. Worse, it is an open invitation to scrimp on testing even when there is no blockage.
What can we do about blockages? When a blockage results from a hard-to-control dependency, it is good evidence of a latent bug or a code smell (i.e., dead code, hard-coded "fixes," poor assumptions, lack of robustness) that would benefit from refactoring. For benign blockages, readily available tools for test mocking can achieve conditions that trigger exceptions, return values from third-party libraries, or stand-in for unavailable software or hardware (See, for example, this exchange on StackOverflow.). Certain features of object-oriented languages can also obstruct coverage. For example, methods of abstract base classes and private methods cannot be directly invoked. Test design patterns to address this kind of blockage are well-established.
Hard blockages that cannot be resolved do occur, especially in legacy systems. Code that cannot be tested owing to a true blockage is arguably more likely to be buggy because the SIUT does not completely control the behavior of the blocked code. In this case, skipping verification of blocked code is playing coverage roulette. So, while you may not be able to test it, you certainly should verify it by other means. Team inspection of the blocked code is often simple and effective. Pay special attention to imagining all the conditions under which it may be executed and how it will respond. This analysis is like pondering a chess move--consider how the runtime environment might respond to each execution alternative.
So, now you know why accepting partial coverage is playing code coverage roulette. In the second part of this post, I'll describe specific testing practices that result in full coverage and consistently effective testing.