Why Netflow Data Still Matters
Network flow plays a vital role in the future of network security and analysis. With more devices connecting to the Internet, networks are larger and faster than ever before. Therefore, capturing and analyzing packet capture data (pcap) on a large network is often prohibitively expensive. Cisco developed NetFlow 20 years ago to reduce the amount of information collected from a communication by aggregating packets with the same IP addresses, transport ports, and protocol (also known as the 5-tuple) into a compact record. This blog post explains why NetFlow is still important in an age in which the common wisdom is that more data is always better. Moreover, NetFlow will become even more important in the next few years as communications become more opaque with the development of new protocols that encrypt payloads by default.
The History of Netflow
Capturing pcap data was a hard problem even when networks were smaller and slower than they are today. In the early days of NetFlow, the goal was to collect only the most important information from the transaction to keep the flow record as compact and lightweight as possible. NetFlow has evolved from that initial goal, and now NetFlow sensors and collections systems are able to do much more than simply aggregate packets and bytes. Network flow remains relevant in network security because it is still the most efficient way to collect and store information about the endpoints, communications, applications, and users that make up the cyber environment.
Network administrators and data analysts are in a constant struggle between wanting to store everything and having too much data to manage. Fortunately, collecting every flow record, even in large networks, is still an achievable goal. Flow data is tried and true for accounting, network forensics, and creating baseline network profiles useful for identifying malicious or anomalous activity. It can also help business administrators make decisions about prioritization of resources or how to plan for necessary changes to the network. Flow data may help an organization identify
- misconfigurations of network devices
- data exfiltration
- network scans from an external source
- denial-of-service attacks
- machines that are beaconing to a command-and-control (C2) server
The concept of network flow data started to evolve with the introduction of new NetFlow formats, such as Cisco's NetFlow v9 and IPFIX. Researchers started to explore how to expand NetFlow to report per-packet information to get a more detailed understanding of the network traffic. PSAMP (Packet SAMPling) and SIP session data export were topics that gained attention within the Internet Engineering Task Force (IETF). Likewise, metadata collection and deep packet inspection (DPI) were developed within flow collectors. Analysts who examined this data realized the value of certain fields within the HTTP and DNS protocols for identifying malicious activity and command-and-control infrastructure. Analysts then became interested in correlating DNS and HTTP logs with flow data to understand the extent of a compromise to the network.
These days, we have to deploy several different sensors (e.g., a PCAP sensor, a NetFlow sensor, a Bro sensor, and an IDS/IPS sensor) on our network to create a well-rounded security sensing stack. Each of these sensors do something unique and important to protect our network:
- PCAP sensors collect an exact record of network communications.
- NetFlow sensors quickly retrieve and store network transactions.
- Bro sensors provide advanced protocol analysis.
- IDS/IPS sensors generate alerts about well-known attacks or significant anomalies from expected network traffic.
The next issue to address is whether all this data can be stored and correlated in a way that is useful for identifying concerns and problems. Wouldn't it be nice if various data types could be analyzed to yield new metrics to make better sense of a network? For example, by combining advanced protocol data with flow, it would be possible to find a relationship, or lack thereof, between malicious HTTP traffic and packet inter-arrival times.
Correlating different data types was once a burdensome problem due to slow network speed, lack of standards and interoperability, and database restrictions. The effort required was a huge undertaking since a lot of custom development was needed to get the data in one place. Fortunately, networks, systems, and tools have advanced in such a way that many of these issues are much more tractable. Moreover, all these data sources actually contain some flow information. In most cases, NetFlow can be the primary key to connect the various network data sources.
Flow provides a great foundation for understanding the environment. The amount of data that is generated from the various sensors can be overwhelming and it is often hard to even know where to begin. Once a baseline profile for the network has been determined through utilizing the metrics that flow provides, defenders can monitor for changes and investigate further via flow and additional data sources.
The Evolution of Flow 2.0
The next generation of NetFlow innovators are migrating network analysis to big data platforms. While this transition may seem obvious, a lot of work must be done to develop a system that is record-structure agnostic so that flow and non-flow data can be integrated in the same analysis. Fortunately, advanced analytic capabilities, such as machine learning and graphing algorithms, can help identify patterns that were not detectable before. When flow and non-flow data is combined, systems can perform data enrichment so that network defenders can quickly and easily make decisions about potential concerns on the network.
There is work to be done on the sensing side as well. We could be smarter about what we collect and store. To make the most out of every byte that is stored and presented to analysts, important decisions about the data must be made as early as possible. Capturing a million copies of Facebook's X.509 certificate is not analytically useful. Removing duplicate protocol data, such as SSL certificates, DNS requests and responses, and HTTP headers, can significantly reduce the data presented to an analyst, as well as identify anomalies when information unexpectedly changes. Similarly, capturing encrypted payload is not beneficial to retain or analyze.
If the analytic platforms can notify sensors in real time about how and when to collect payload of certain flows, the burden on storage systems can be decreased and the retention time and potential value of the data collected can be increased. Likewise, tagging data early, whether by application labeling or identifying suspicious IPs and domains, can reduce the time between recognizing a problem and addressing it.
We need to take advantage of the analysis that can be automated and spend time incorporating those analyses onto the sensors so it is available at the earliest point in the sensing chain. For instance, we might try to identify network patterns in known malware that may be hard to detect with intrusion detection systems, then develop methods in flow sensors to tag the traffic early so it can be combined with other data sources and analyzed in more detail.
CERT Tools for Cyber Situational Awareness
For over a decade, CERT has been developing tools and techniques for establishing cyber situational awareness. The following tools provide methods for collecting and analyzing network flow data to support awareness:
- YAF (Yet Another Flowmeter) captures network information and creates IPFIX-based flow records. This tool provides additional features beyond creating traditional flow records, such as indexing and creating PCAP data, performing deep packet inspection and application labeling, and reading IPFIX files.
- SiLK, the System for Internet-Level Knowledge, provides the capability to collect, store and view the raw flow data generated by YAF or most other flow sensors.
- super_mediator is an IPFIX mediator for use with the YAF and SiLK tools. It collects and filters IPFIX data and can convert the data to CSV or JSON or export the data to other IPFIX collectors (e.g., Pipeline and SiLK). It also performs deduplication of deep packet inspection data exported by YAF.
- Analysis pipeline provides a streaming analytic toolkit to support inspection of flow records as they are created. It is now capable of inspecting IPFIX fields beyond traditional flow data (e.g., DNS, HTTP, and SSL DPI data created by YAF).
The Future of Flow
Flow is likely to become more important in the near future with the development of protocols like QUIC, TCP-ENO and MPTCP. With the evolution of cloud and mobile technologies, the transport layer is starting to evolve at a rapid pace that will impact network analysis, including flow analysis. Moreover, the definition of flow is expanding with protocols, such as MPTCP, that divide a traditional TCP flow into several subflows, making it hard for current IDS/IPS and protocol analyzers to reassemble the payload and identify anomalies.
Many new protocols enable encryption by default, and most connections will become opaque to IDS/IPS, protocol inspection, and similar technologies. We will need to fall back on the non-signature-based protection and detections systems, such as NetFlow. It will also be important to develop non-signature-based analytics that complement flow, such as passive DNS and SSL certificate analysis. The intra-communication information that some sensors provide will benefit machine learning and automated analysis approaches. Analyzing traffic behavior patterns will become even more important in gaining cyber situational awareness.
Network flow analysis is critical to gaining cyber situational awareness. We will be discussing the future of flow and more innovative network flow analysis at FloCon 2017 in San Diego, CA.
Learn more about FloCon.