Machine Learning in Cybersecurity

The year 2016 witnessed advancements in artificial intelligence in self-driving cars, language translation, and big data. That same time period, however, also witnessed the rise of ransomware, botnets, and attack vectors as popular forms of malware attack, with cybercriminals continually expanding their methods of attack (e.g., attached scripts to phishing emails and randomization), according to Malware Byte's State of Malware report. To complement the skills and capacities of human analysts, organizations are turning to machine learning (ML) in hopes of providing a more forceful deterrent. ABI Research forecasts that "machine learning in cybersecurity will boost big data, intelligence, and analytics spending to $96 billion by 2021." At the SEI, machine learning has played a critical role across several technologies and practices that we have developed to reduce the opportunity for and limit the damage of cyber attacks. In this post--the first in a series highlighting the application of machine learning across several research projects--I introduce the concept of machine learning, explain how machine learning is applied in practice, and touch on its application to cybersecurity throughout the article.

Machine learning refers to systems that are able to automatically improve with experience. Traditionally, no matter how many times you use software to perform the same exact task, the software won't get any smarter. Always launch your browser and visit the same exact website? A traditional browser won't "learn" that it should probably just bring you there by itself when first launched. With ML, software can gain the ability to learn from previous observations to make inferences about both future behavior, as well as guess what you want to do in new scenarios. From thermostats that optimize heating to your daily schedule, autonomous vehicles that customize your ride to your location, and advertising agencies seeking to keep ads relevant to individual users, ML has found a niche in all aspects of our daily life.

To understand how ML works we first need to understand the fuel that makes ML possible: data. Consider an email spam detection algorithm. Original spam filters would simply blacklist certain addresses and allow other mail through. ML enhanced this considerably by comparing verified spam emails with verified legitimate email and seeing which "features" were present more frequently in one or the other. For example, intentionally misspelled words ("V!AGR4"), the presence of hyperlinks to known malicious websites, and virus-laden attachments are likely features indicative of spam rather than legitimate email. (More discussion on "features" below.) This process of automatically inferring a label (i.e., "spam" vs "legitimate") is called classification, and is one of the major applications of ML techniques. It is worth mentioning that one other very common technique is forecasting, the use of historical data to predict future behavior. While considerable research and technology has been developed to perform forecasting, the remainder of this post will focus on classification.

There are two major types of ML classification techniques: supervised learning and unsupervised learning, which are differentiated by the data (i.e., input) that they accept. Supervised learning refers to algorithms that are provided with a set of labeled training data, with the task of learning what differentiates the labels. While in our previous example there were only two labels--"spam" and "legitimate"--other scenarios may contain many, many more. For example, modern image recognition algorithms, such as Google Image search, can accurately distinguish tens of thousands of objects, and modern facial recognition algorithms exceed the performance of human beings. By learning what makes each category unique, the algorithm can then be presented with new, unlabeled data and apply a correct label. Note the criticality in choosing a representative training dataset; if the training data contains only dogs and cats, but the new photo is a fish, the algorithm will have no way of knowing the proper label.

Unsupervised learning refers to algorithms provided with unlabeled training data, with the task of inferring the categories all by itself. Sometimes labeled data is very rare, or the task of labeling is itself very hard, or we may not even know if labels exist. For example, consider the case of network flow data. While we have enormous amounts of data to examine, attempting to label data would be extremely time-intensive, and it would be very hard for a human to determine what label to assign. Given how good machines are at finding patterns in large datasets, it is often much easier to simply have the machine separate data into groups for us.

Note that separating data into groups assumes that the relevant data is present. Determining the color of someone's skin is fairly trivial for a sighted person, but a blind person will find that task much harder since they are lacking the most important sensor. They will have to rely on other information, such as the person's voice, to correctly "label" the individual. Machines are no different in this regard.

We mentioned earlier the concept of a feature. This concept can be understood fairly straightforwardly: if our data is stored in a spreadsheet where a single row represents one data point, then the features are the columns. For our email example, some features may be the sender, recipient, date, and content of the email. From our network flow example, features include packet size, remote IP address, network port, packet content, or any of the hundreds of different attributes that network traffic can have. Having useful features is a critical prerequisite for being able to successfully apply machine learning techniques. Simultaneously, having too many non-informative features may degrade algorithm performance, as the overabundance of noise can hide more useful information.

To that extent, there is an entire branch within machine learning referred to as feature engineering. The goal of this practice is to extract the maximum information from the available features so as to maximize our ability to predict or categorize unknown data. Frequently these techniques will take multiple features and combine or transform them in complex ways to obtain new, more informative features. While a full treatment of these approaches is outside the scope of this article, interested readers are encouraged to read up on Principle Component Analysis (PCA), a fairly straightforward yet highly useful technique for both creating new data from existing features, as well reducing the number of total features required for the algorithm to function.

One last topic to address is that of big data. From the above cases we can understand that more data is almost always a good thing; it allows algorithms to be aware of many more varieties of categories. Continuing our email example, while one person may get a lot of spam, many people get a tremendous amount of spam, providing that many more examples for the ML algorithm to train against. Within the past ten years, as the value of data has been realized, enormous databases for all types of imaginable data have sprung up containing sometimes billions of rows of data, with hundreds of thousands of features. Such enormous datasets are technically hard to work with, and an entire field of research and tooling has developed with the specific intent of simplifying the process of working with data of this size. This is the field of big data.

The steps required to create a ML tool are varied, but typically proceed as follows:

Data collection. While it's possible to run and even create ML algorithms based on streaming, real-time data (e.g., trading decision based on stock market data), the majority of techniques involve collecting data ahead of time and creating a model using stored data.
Data cleaning. Raw data is often unusable for ML purposes. There may be missing data, inconsistent data use (e.g., a cardinality feature may contain "North", "north", and "N", all identical in meaning), and numeric data with non-numeric characters, among many other possible problems. This step also involves combining multiple data sources to a single usable source. Cleaning is often a time-consuming and iterative process, as fixing one issue often uncovers another.
Feature engineering. After all the data is ready for use it's time to ensure that maximum information is extracted from the data itself, as described above. This process usually takes place prior to creating the ML algorithm.
Model building/model validation. This set of steps involves building the model and testing to ensure it works properly on unlabeled data. There are many statistical considerations to consider when testing the model. When working with supervised ML, a chief concern is whether the model is overfit to the training data, i.e., whether the model that was produced takes into account properties that are unique to the training data. There are many statistical techniques used to minimize this risk, which are often employed during model validation.
Deployment/Monitoring. Deployment of an ML model is rarely a "once-and-done" event. Generally, and especially in the case of network traffic, historical observations do not necessarily match future activity. For that reason, even after deployment, models are monitored and periodically rerun through the build/validate step to ensure top performance.

Looking Ahead

In future posts in this series, I will be exploring the application of machine learning to various research areas to reduce the opportunity for, and limit the damage of, various types of cyber attacks. Some specific areas include:

insider threat
malware analysis
network analytics
secure coding
situational awareness

In addition to the above, we will also describe how we are applying ML to training new cybersecurity analysts.

I welcome your feedback on this work in the comments section below.

Software Engineering Institute

SEI Blog

Machine Learning in Cybersecurity

Eliezer Kanal

June 5, 2017

PUBLISHED IN

CITE

TAGS

SHARE

Additional Resources

Written By

Eliezer Kanal

Author Page

Digital Library Publications

Send a Message

More By The Author

Could Blockchain Improve the Cybersecurity of Supply Chains?

November 4, 2019 • By Eliezer Kanal

Artificial Intelligence in Practice: Securing Your Code Using Natural Language Processing

October 7, 2019 • By Eliezer Kanal

Obsidian: A New, More Secure Programming Language for Blockchain

September 4, 2018 • By Eliezer Kanal

What Is Bitcoin? What Is Blockchain?

July 24, 2017 • By Eliezer Kanal

More In Artificial Intelligence Engineering

Applying Large Language Models to DoD Software Acquisition: An Initial Experiment

April 1, 2024 • By Douglas Schmidt (Vanderbilt University), John E. Robert

OpenAI Collaboration Yields 14 Recommendations for Evaluating LLMs for Cybersecurity

February 21, 2024 • By Jeff Gennari, Shing-hon Lau, Samuel J. Perl

Using ChatGPT to Analyze Your Code? Not So Fast

February 12, 2024 • By Mark Sherman

Creating a Large Language Model Application Using Gradio

December 4, 2023 • By Tyler Brooks

Generative AI Q&A: Applications in Software Engineering

November 16, 2023 • By John E. Robert, Douglas Schmidt (Vanderbilt University)