search menu icon-carat-right cmu-wordmark

Test Suites as a Source of Training Data for Static Analysis Classifiers

This video by Lori Flynn was recorded as part of the ACM/IEEE International Conference on Automation of Software Test AST 2021 (co-located with ICSE).

Software Engineering Institute




Flaw-finding static analysis tools typically generate large volumes of code flaw alerts including many false positives. To save on human effort to triage these alerts, a significant body of work attempts to use machine learning to classify and prioritize alerts. Identifying a useful set of training data, however, remains a fundamental challenge in developing such classifiers in many contexts. We propose using static analysis test suites (i.e., repositories of benchmark programs that are purpose-built to test static analysis tools) as a novel source of training data. Specifically, we generated a large quantity of alerts by executing various static analyzers on the Juliet test suite, and we automatically derived ground truth labels for these alerts by referencing the Juliet test suite metadata. Finally, we used this data to train classifiers to predict whether an alert is a false positive. Our classifiers obtained high precision and recall for a large number of code flaw types on a hold-out test set. This preliminary result suggests that pre-training classifiers on test suite data could help to jumpstart static analysis alert classification in data-limited contexts.

To learn more, visit the AST 2021 website to view the full live streamed conference video and paper preprint.