Test Suites as a Source of Training Data for Static Analysis Classifiers
Software Engineering Institute
Flaw-finding static analysis tools typically generate large volumes of code flaw alerts including many false positives. To save on human effort to triage these alerts, a significant body of work attempts to use machine learning to classify and prioritize alerts. Identifying a useful set of training data, however, remains a fundamental challenge in developing such classifiers in many contexts. We propose using static analysis test suites (i.e., repositories of benchmark programs that are purpose-built to test static analysis tools) as a novel source of training data. Specifically, we generated a large quantity of alerts by executing various static analyzers on the Juliet test suite, and we automatically derived ground truth labels for these alerts by referencing the Juliet test suite metadata. Finally, we used this data to train classifiers to predict whether an alert is a false positive. Our classifiers obtained high precision and recall for a large number of code flaw types on a hold-out test set. This preliminary result suggests that pre-training classifiers on test suite data could help to jumpstart static analysis alert classification in data-limited contexts.