Benford's Law: Potential Applications for Insider Threat Detection
Detecting anomalous network activity is a powerful way to discover insider threat activities. It is time consuming, however, to establish baseline traffic and process traffic data. This blog post explores how a mathematical law, already used in forensic accounting, may help detect insider activity without the effort of traditional anomaly detection.
Benford's law of anomalous numbers states that generally, in naturally occurring collections of numbers, the leading digit is likely to be small. This means that the numeral 1 will be the leading digit in a genuine dataset 30.1 percent of the time; the numeral 2 will be the leading digit 17.6 percent of the time; and each subsequent numeral, 3 through 9, will be the leading digit with decreasing frequency. The resulting downward-sloping curve can be used as a baseline for determining whether a dataset is genuine or fabricated.
Accountants often compare the leading digits of financial transaction data, such as ledger entries, to a Benford curve to spot anomalies that may indicate fraud. The same technique can be used to detect irregular network activity and other data that may indicate malicious insider activity.
Benford's law is grounded on base-10 logarithms that calculate the probability that number x will begin with digit d if log10(x) lies in the interval of length log10(d+1) - log10(d) = log10(1+1/d). When plugging in the digits 1 through 9, each subsequent digit has a diminishing probability that it will be the leading digit.
Figure 1: Logarithmic Intervals of Leading Digits, Based on log10(x)
The size of the number doesn't matter. Whether you're dealing with five-digit or two-digit numbers, the probability of a given leading digit can be predicted for data fitting Benford assumptions by looking at the first two decimals of the base-10 log of the number.
Consider 1,002: log10(1,002) ≈ 3.000867. The first two decimals are within the .00-.30 interval, for base-10 log values of numbers with a leading digit of 1. This position reflects the fact that 30 percent of naturally occurring numbers that fit the Benford assumption have a leading digit of 1. Similarly, consider 52: log10(52) ≈ 1.716000334. The first two decimals are within the .70-.78 interval, for base-10 log values of numbers with a leading digit of 5. This position reflects that 8 percent of numbers fitting Benford assumptions have a leading digit of 5. The table below shows a more complete example of how the probability of leading digits is found with base-10 logarithm calculations.
Table 1: Example of Base-10 Logs for Leading Digits 1-9
The conclusion from all this math: numbers in a dataset that fits all the Benford assumptions should follow this distribution of leading digits, with 1 being the most common and 9 being the least.
Figure 2: Probability Distribution of Leading Digits Under Benford's Law
For a conclusion on a Benford curve to be valid, the data must (1) be numeric, (2) be randomly generated, (3) be large, and (4) represent magnitudes of events. Many types of data fit these assumptions, including population counts, accounting data, and network traffic. Data comprising numbers used as identifiers, such as phone numbers and social security numbers, violates the assumption that the data is generated randomly.
Figure 3: Leading Digit Distribution of Population Data
Application in Accounting
Benford's law is widely used in accounting to examine data for anomalies that may indicate fraud. Accountancy data generally follows the four assumptions required for a valid conclusion on a Benford curve: general ledgers, income statements, and inventory listings can all be compared to the curve to determine genuineness.
This analysis may be admissible evidence of fraud in federal and state courts. The forensic accounting community generally accepts the methodology, which is referenced in the Fraud Examiners Manual. Forensic accountants, fraud examiners, accountants, and auditors use Benford's law to detect anomalies that require investigation. The combination of the method's widely accepted usage, academic reputation, and wide availability of experts make the admissibility of Benford analyses likely.
Shifting the Framework to Technical Insider Threat
Network traffic typically follows the four assumptions required for a conclusion on the Benford curve to be valid. The Benford analysis' long-standing use in accounting and its suitability for information security's naturally generated data make the process viable for technical insider threat. Benford analysis is especially useful in detecting both highly likely and unlikely data points, so it serves as a dual measure of both normalcy and aberration.
Current cybersecurity systems rely heavily on identifying anomalous behaviors. Looking only for known signatures does not address the breadth of the threat landscape--unknown signatures are equally important. Anomaly detection is generally hard to establish because creating a baseline traffic profile and processing the large amount of traffic data are time-consuming processes.
Benford's law can help avoid the effort of baseline-derived anomaly detection. If the network traffic conforms to the assumptions of Benford's law, any traffic data deviating from the Benford curve can be considered an anomaly. Benford's law performs much of the legwork, rather than manual computation.
A small-scale example application of this technique can be demonstrated with spreadsheet macros.
Insider Threat Applications
To demonstrate the potential applications of Benford's law to insider threat detection, let's explore some scenarios inspired by those we capture in the CERT Insider Threat Incident Corpus.
An organization has alert thresholds set at $500 and $1,000. One of the bookkeepers has been writing bad checks for $499 and $999, just under the alert thresholds. After the organization performs an audit and compares the data from the books to the Benford curve, the bookkeeper is caught.
In this situation, the digits 4 and 9 occur as the leading digits more frequently than they should in a natural dataset. This data does not conform to Benford's law and is cause for concern.
Figure 4: Hypothetical Ledger Data Affected by Check Fraud
An employee creates fictitious invoice charge data to hide their illicit activity by randomly typing numbers on the horizontal number keys. Another employee notices irregularities in the Benford analysis of the invoice data, and the employee who created the fictitious data is caught.
In this situation, the digits 4, 5, 6, and 7 occur as the leading digits more frequently because of the employee's hand placement on the number keys. Even fabricated data that seems random can be separated from genuine data.
Figure 5: Data Generated by Typing on the Horizontal Number Keys
A disgruntled co-founder of a tech company argues with his partner and decides to leave the company, but not before downloading large trade-secret files. The co-founder has authorized access to the trade secrets and regularly views and works with the files. He deals with numerous uploads and downloads on a daily basis, so he doesn't think he'll get caught.
Measures of network traffic generally follow a Benford curve. Though the co-founder typically deals with the trade secrets and has high network usage, his unexpected increase in normal network activity shifts the distribution of leading digits in the company's network traffic, signaling an abnormality. An analytic to detect changes in the statistical distribution of network activity triggers an alert of suspicious activity. In this case, the co-founder does not get away with it.
An employee finds out he is going to be laid off and decides to launch a denial-of-service (DoS) attack on the company's network. The company's IT department has recently established baseline interval times and packet lengths. They are quickly able to identify the anomaly caused by the employee and stop the attack.
Benford's law is especially useful in detecting DoS attacks because flooding a network with data breaks the naturalness of network traffic.
It is important to use the resources that we already have access to. Many accounting departments having longstanding experience with Benford analyses, so applying the Benford framework to an information security context should be simpler than creating new techniques for monitoring threshold activity. This control does not rely on labeled historical data. Instead, it leverages the data's natural conformity to the assumptions of Benford's law and tests that conformity against the Benford expectation.
Not all organizational data fits the Benford assumptions. For example, organizations that consistently facilitate transactions with high leading digits may find that the Benford method is of limited use. In the future, we could compare the return on investment and efficacy of using Benford analysis for anomaly detection compared to more conventional statistical methods used for insider threat, such as Bayes' theorem.
Aamo, I., "On the Use of Benford's Law to Detect JPEG Biometric Data Tampering." Journal of Information Security, 8, 2017, 240-256. https://file.scirp.org/pdf/JIS_2017071914213246.pdf
Reese, M., "Why Cyber Security Should Care About Benford's Law." LinkedIn, 2019. https://www.linkedin.com/pulse/why-cyber-security-should-care-benfords-law-mindy-reese
Sarkar, T., "What is Benford's Law and Why Is It Important for Data Science?" Towards Data Science, 2018. https://towardsdatascience.com/what-is-benfords-law-and-why-is-it-important-for-data-science-312cb8b61048