Posted on by Network Situational Awareness

in

One of my responsibilities on the Situational Awareness Analysis team is to create analytics for various purposes. For the past few weeks, I've been working on some anomaly detection analytics for hunting in the network flow traffic of common network services. I decided to start with a very simple approach using mean and standard deviation for a historical period to create a profile that I could compare against current volumes. To do this, I planned on binning network traffic by some length of time to find time periods with anomalous volumes. The question I then had to answer was, "How should I define the historical period?" In this post, I explain the process I used to answer that question.

To define the historical period, I needed to answer several questions:

- What is the length of each time bin?
- How many bins do I need for my mean and standard deviation calculations to be representative of current expected behavior?
- What time bins should be used for the mean and standard deviation calculations?

The range of possible lengths of a time bin depends on the granularity of how your network flow collector specifies time--from milliseconds to weeks to months to even a year. However, none of these measures makes much sense in practice. I decided I would look at two lengths for time bins: an hour and a day.

In general, shorter time bins are more robust when applied to short-duration anomalies of high magnitude, while longer time bins are more robust when applied to long-duration anomalies of low magnitude.

To determine how many bins I needed for the mean and standard deviation calculations, I took three things into consideration:

- Many network services exhibit seasonality in their related flow traffic, often related to the work day. I did not want normal fluctuations for a work day or work week to look anomalous, so I needed to choose a history that was longer than the seasonal period so that the mean and standard deviation captured those normal variations.
- Networks evolve over time. The normal volume of traffic last year for a given service is probably very different than the normal volume for today. This evolution occurs due to both normal day-to-day activities, like gaining new users and losing old ones, as well as major organizational and network architectural changes. Since I was focusing on simple analysis, I didn't want to worry about modeling change. So, I wanted to choose a history that was most likely to represent the current state of the network and not be influenced too much by previous states.
- I needed enough data points to get a reasonable mean and standard deviation.

The concept of seasonality made me realize that it may be possible to improve anomaly detection, even in a simple mean and standard deviation analytic, by thinking about how I chose my time bins for the history. I realized that I could choose a history containing time bins that are all consecutive to each other. Or, I could choose time bins that correspond in some manner to the current time bin I wanted to evaluate.

For the corresponding method, I considered two options each for hourly and daily time bins. For hourly, I considered using the same hour of the day for some number of consecutive days and using the same hour for the same day of the week for some number of weeks. For daily, I considered using the same day of the week for some number of weeks, and the same day of the month for some number of months.

The same day of month option for daily time bins does not seem like a good option for most network services and networks. Networks change so rapidly that a history with the newest value already a month old is unlikely to create a mean and standard deviation that reflects the current network state and user and service behavior.

The other history options each have their uses. For network services that exhibit little to no seasonality on the network of interest, a consecutive history works well. A consecutive history also works for services that exhibit marked seasonality, but anomalies would need to be of greater magnitude to be detected than if a corresponding method was used. Looking at the same hour for every day works well when a network service has daily, but little to no weekly seasonality. If the history has both daily and weekly seasonality, looking at the same hour of the day for the same day of the week will be most sensitive to anomalies.

After exploring different parameters for the history, I decided on hourly bins for a 14-day consecutive history as a quick and simple hunting analytic.