A Behind the Scenes View of Supporting ML-based Security Solutions
• Presentation
Publisher
Software Engineering Institute
Topic or Tag
Abstract
Machine learning solutions for security use cases have gained widespread support among researchers, security vendors, and practitioners. The promises of automation have made these solutions particularly appealing as a way to maintain visibility into the threat landscape with limited resources. But these promises may have given many leaders the false impression that maintaining a production ML solution does not require manual effort that is, in many cases, quite tedious. While ML researchers acknowledge the importance of these manual efforts, they are often relegated to second-class citizenry and not properly maintained.
In this talk, we make the importance of human intervention explicit by demonstrating the impact of label cleaning and false positive triaging on a production ML-based network threat solution. Specifically, we show how efforts to prematurely automate these tasks led to degraded efficacy and demonstrate the effects of not maintaining these models over time, which is critical to security solutions given the domain's inherent non-stationarity. We provide many real-world examples of corner cases and perform a rigorous comparison between models with low and high levels of manual and automated intervention. Finally, we offer additional support for human-in-the-loop systems that leverage automation and simple statistical analyses to provide context to domain experts, making these tasks much more efficient. For instance, we have developed an interactive labeler that takes a process name, parent process name, process hash, and Internet destinations visited by that process as input and asks the domain expert to provide a class label.
Purely automated systems have failed for a variety of reasons: process names can be easily changed, process hashes are often missing from common threat intelligence sources like VirusTotal, and Internet destinations are often seen with many unrelated processes. The interactive labeler presents the raw features along with contextual data, such as any available threat intelligence, and a set of similar, previously labeled processes based on several views, such as edit distance between process names and a classifier using the Internet destinations as features. The domain expert can then either select a suggested label, create a new label, or decide that there isn’t enough data to currently make a decision. While this approach was straightforward, it solved a key issue relating to dataset fidelity and allowed us to build a set of ~2,000 labels that cover 99.99% of the 3+ billion Internet connections that we collected with endpoint ground truth in a matter of weeks (with a single expert’s part-time effort).
Attendees Will Learn
This presentation will provide a view into the manual work required to maintain an ML solution in the security space, hopefully informing leaders about the true maintenance needs. Additionally, we will examine several automation pitfalls that can derail the performance of your ML security project.