Monitoring Massive Network Traffic using Bayesian Inference

January 9, 2019 • Presentation

By

David Rodriguez (Cisco Systems, Inc.)

In this presentation, the author discusses methods for performing large scale Bayesian inference on DNS logs aggregated into count data, representing the number of requests from tens of millions of stub IPs made to hundreds of millions of domains.

Publisher

Cisco Systems, Inc.

Subjects

Flocon

Abstract

Monitoring network logs from DNS requests to TCP connections is challenging because these logs are both large and noisy- hindering efforts to identify malicious traffic. In a sizable network, for example, it is common to see thousands of requests made to one destination- at one time the frequency is cyclical and at another sporadic. This random behavior in network connections causes most unsupervised and supervised statistical modeling to fail. In this talk we discuss methods for performing large scale Bayesian inference on DNS logs aggregated into count data, representing the number of requests from tens of millions of stub IPs made to hundreds of millions of domains. We describe novel mixtures of common discrete distributions, or hidden Markov processes, that model some of the most sporadic network traffic volumes to domain names. For example, we discuss how the zero inflated Poisson (ZIP) and zero inflated negative binomial (ZINB) distributions, and their more generalized forms, provide parameters we can use to differentiate traffic volumes associated with day-to-day threats from spam and malvertising to widespread threats arising from botnets. Using Apache Spark and Stripe’s newly released Rainier - a powerful Bayesian inference software for the JVM - we run tens of thousands of simulations per domain, fitting the underlying distribution of requests, then repeating this for millions of domains. We profile the performance by fitting a variety of mixtures of distributions to different sporadic traffic volumes. Running simulations often, we then show how to efficiently trend parameter estimates using exponential moving averages to model day/night and weekday/weekend traffic distributions. With hundreds of thousands of simulated and archived traffic patterns associated with benign and malicious network traffic, we show how to reduce false alarms to effectively monitor evolving online threats and masquerading malicious traffic.

Software Engineering Institute