Data-Driven Detection Using PySpark
Software Engineering Institute
This presentation was given at FloCon 2023, an annual conference that focuses on applying any and all collected data to defend enterprise networks.
Organizations with large log volumes, custom log types, many different logs, and/or special analysis needs have moved beyond the capabilities of SIEMs with packaged analysis or special purpose languages focused primarily on search / matching use cases. Also, Machine Learning expertise is becoming deeper and broader, particularly in large organizations with dedicated Data Engineering and Data Science teams. These trends are driving Security organizations more toward Data-driven analysis that require highly flexible interfaces such as Python notebooks.
Our Detection Engineering team has been using our PySpark platform to build streaming pipelines that can cover basic rule-based use cases, but also full ML models registered with MLFLOW. In this talk we'll discuss the underlying Python framework we've built for our own operational needs and are releasing to the public.
This is an applied talk, based on our own experiences building an operational Detection system processing multi-TB/day in 30+ pipelines. We believe others will benefit from our example, as well as the code that we're releasing before or in concert with this talk.