search menu icon-carat-right cmu-wordmark

Working with the Internet Census 2012

Timur Snoke Deana Shick

Hi, it's Timur Snoke of the CERT NetSA group, posting on behalf of Deana Shick and Angela Horneman. It's not every day that 9.6 terabytes of data is released into the public domain for further research. The Internet Census 2012 project scanned the entire IPv4 address space using the Nmap Scripting Engine(NSE) between March and December of 2012. The engineer of this data set (identity unknown) saved and released the collected data in early 2013. The data is broken down into seven types of scan results: ICMP ping, reverse DNS, service probes, host probes, syncscan queries, TCP/IP fingerprints, and traceroute.

This information has proved valuable in our research in understanding aspects of devices associated with various sets of IP addresses, as shown in our upcoming tech report "Investigating Advanced Persistent Threat 1." This vast source of information also has potential for many other research projects.

How the Internet Census Data Was Collected

The NSE allows users to write unique scripts to augment Nmap for their individual needs. Instead of scanning a personal network, the engineer used a personal NSE script to port scan the /0 address space. The engineer conducted the scan by developing a botnet, the Carna botnet, that deployed a small binary onto a group of non-secure sample machines. These non-secure machines were used to build a port scanner for the entire IPv4 address space.

The botnet consisted of a central server for data collection and analysis, middle nodes to transmit large pieces of data to the central server, and many devices for data collection. The data collection machines scanned other machines via a slew of Python scripts from this binary on Telnet port 23, which transferred data back to middle nodes. The middle nodes forwarded the information to the main server for data collection and analysis.

Because the Internet Census 2012 data was collected with the use of a botnet and scans done without permission, there are some ethical issues that surround the use of this data set in research. With the internet census, and any other data set obtained in what may be viewed an illegitimate manner, researchers must decide if they have the right to use the information. In some cases, the answer may be clear cut. For example, it's clear cut if the data set contains legally restricted information, but other cases are not so clear cut and are a judgment call. Researchers responsible for determining the impact of a botnet or remediating issues arising from a botnet must look at information related to that botnet, including what it collected, to map out appropriate responses and learn how to prevent future botnet infections. Other researchers may benefit by using the information collected from botnets in other types of studies. However, if they choose to use the information, they have the responsibility to use the data in a manner that protects the people impacted in collection from further harm. Some ways to do this are to only use information in aggregate, anonymized data and to be selective in publication.

As far as we can tell, this data set does not contain legally restricted information, like classified information, personally identifiable information, or trade secrets.

Types of Saved Scans

Internet Control Message Protocol (ICMP) Ping

When an ICMP ping is sent to a device, it measures the time the device takes to return a response. The response will return one of the following: network unreachable, host unreachable, communication administratively prohibited, alive, alive from A.B.C.D, or unreachable from A.B.C.D. For the purposes of the IPv4 scan, the ICMP ping tells if the address was reachable, and if not, what IP address was used for the scan. This scan sent an ICMP ping to every address in the IPv4 space every few days.

The ICMP ping data provided in the internet census is divided by CIDR /8 netblock; each /8 netblock contains its own unique file. Besides the target IP address and status, the ICMP ping records also include a Unix format timestamp. Below are examples of IP address listings in an internet census ICMP file.

IP Address Timestamp Status
10.0.0.1 1344755700 Unreachable
10.0.0.2 1355490900 ICMP Type: 11, ICMP Time Exceeded

Individual IP addresses may or may not have multiple entries within the ICMP data. Since the scans were done multiple times within the March through December 2012 time frame, these entries indicated that only a portion of the responses collected from the census were released. Since this is the case, we cannot tell if a particular IP address never had an "alive" response. However, it may be useful to know that an IP address has at least one "alive" entry in the data.

Reverse DNS (rDNS) Scan

This scan queried every IP address in the IPv4 address space for related domain names by sending requests to the top 16 DNS servers. Once compiled, DNS records were sent back to the collection server. The reverse DNS data provided in the internet census is divided by CIDR /8 netblock; each /8 netblock contains its own unique file. The rDNS entries also contain a Unix format timestamp for when the rDNS response was received. Below are examples of IP address listings in an internet census rDNS file.

IP Address Timestamp Response
10.0.0.1 1338933900 (3)
8.8.8.8 1346766300 google-public-dns-a.google.com

The rDNS data has a large number of entries that have a numeric response, indicating that some error, such as record not found, occurred in the request. For the entries that have domain names in the responses, these may be helpful to use as a snapshot of domain names at various time periods in the second half of 2012. However, keep in mind that this data comes from only the 16 largest DNS servers and that the domain name to IP address mapping can change often.

Service Probe

Service probe scans look for responses on various ports that can be attributed to certain services. The probed IP address returns a state of 1 (open), 2 (open filtered), 3 (reset), 4 (closed/reset), or 5 (timeout). Although 175 billion probes were saved, only a small minority of the data has a return state of 1. For IP addresses returning a state of 2-5, the file lists only the number and no further explanation. If the IP address returns a state of 1, then the response to the probe is recorded and, in many cases, it is in an encoded or unreadable format. The data also includes a Unix format timestamp. Below are examples of IP address listings in an internet census service probes file.

IP Address Timestamp State Result
10.0.0.1 1338933900 5
10.0.0.2 1346766300 1 0=84=00=00=00=10=02=01=01a=84=00=00=00=07=0A=01=00=04=00=04=00

The way the NSE scan handles service probe scans is to use the returned responses where the state is 1 to determine the most likely service that is running on that port. If this determination was done as part of the service probe scans from the Carna botnet, the results were not released as part of the data set.

Within our short time frame, we did not come up with a way to use the data provided to make the determination on our own. For the very small minority of cases where a response returns readable information, there may be some important data, but the analyst must go through the responses manually because the format is not standard, and the content varies widely. We have not developed an automated way to go through this part of the data set, although using a text search tool like grep with regular expressions may enable the identification of entries that may contain things like phone numbers or email addresses.

Host Probe

Host probe scans relay to a researcher whether the IP address is up or down and the reason for each state.This scanning was done as a first step of the ICMP ping scans to determine if a more in-depth scan should be tried against the IP address. The host probe data provided in the internet census is divided by CIDR /8 netblock; each /8 netblock contains its own unique file. The host probe data contains the target IP address, a Unix format timestamp for when the data was received, an indication of whether the device tied to the IP address is up or down, and the reason for the determination. Below are examples of IP address listings in an internet census host probe file.

IP Address Timestamp State Reason
10.0.0.1 1338933900 up reset
10.0.0.2 1346766300 down no-response

Syncscan

The syncscan queries inform a researcher if a particular IP address has open, open-filtered, or closed ports. This scan contains every IP address in the IPv4 address space. The syncscan data provided in the internet census is divided by CIDR /8 netblock; each /8 netblock contains its own unique file. The data lists all of the open, open-filtered, and closed ports associated with each IP address in the /8 netblock. In many cases, an IP address has all three statuses, occurring at various times, and therefore has three different lines in the file. The syncscan query also provides the type of packet sent to a particular IP address, a Unix format timestamp for when the data was received, and the type of packet sent to each port. Below are examples of IP address listings in an internet census syncscan file.

IP Address Timestamp Status Reason TCP/UDP Open Ports
10.0.0.1 1334544300 open syn-ack tcp 1723,3389
10.0.0.1 1335746700 closed reset tcp 20,21,22,23,25,53,80,110,111,143,443,993,995, 3306,5900,8080

TCP/IP Fingerprint

This scan attempts to gather information that identifies the type of device and the operating system (OS) running on the machine. The data includes the target IP address, a Unix format timestamp, and the fingerprint of the target device. Below are examples of IP address listings in an Internet Census TCP/IP fingerprint file.

IP Address Timestamp Fingerprint
10.0.0.1 1340018100 SCAN(V=5.51%D=6/18%OT=443%CT=21%CU=33630%PV=N%DS=21%DC=I%G=N%TM=4FDF0ED 7%P=mips-openwrt-linux-gnu), SEQ(SP=104%GCD=1%ISR=10B%TI=I%CI=I%II=I%SS=S%TS=U), OPS(O1=M5B4%O2=M5B4%O3=M5B4%O4=M5B4%O5=M5B4% O6=M5B4), WIN(W1=2DA0%W2=2DA0%W3=2DA0%W4=2DA0%W5=2DA0%W6=2DA0), ECN(R=Y%DF=N%T=3C%W=3%O=%CC=N%Q=U),T1(R=Y%DF=N%T=3C%S=O%A=S+%F=AS%RD=0%Q=), T2(R=Y%DF=N%T=3C%W=80%S=A%A=Z%F=R%O=%RD=0%Q=),T3(R=Y%DF=N%T=3C%W=100%S=A%A=Z%F=R%O=%RD=0%Q=),T4(R=Y%DF=N%T=3C%W=400%S=A%A=Z%F=R%O=%RD=0%Q=), T5(R=Y%DF=N%T=3C%W=7A69%S=Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=Y%DF=N%T=3C%W=8000%S=A%A=S%F=AR%O=%RD=0%Q=),T7(R=Y%DF=N%T=3C%W=FFFF%S=Z%A=S+%F=AR%O=%RD=0%Q=),U1(R=Y%DF=N%T=3C%IPL=38%UN=0%RIPL=G%RID=G%RIPCK=G%RUCK=G%RUD=G),IE(R=Y%DFI=N%T=3C%CD=S)
10.0.0.2 1346696100 SCAN(V=6.01%E=4%D=9/3%OT=443%CT=21%CU=44254%PV=N%DS=13%DC=I%G= N%TM=5044F25F%P=mipsel- openwrt-linux-gnu),SEQ(SP=107%GCD=1%ISR=107%TI=I%CI=I%TS=U),SEQ(SP=102%GCD=1%ISR=107%TI=I%CI=I%II=I%SS=S%TS=U), OPS(O1=M5B4%O2=M5B4%O3=M5B4%O4=M5B4%O5=M5B4%O6=M5B4),WIN(W1=2DA0%W2=2DA0%W3=2DA0%W4=2DA0%W5=2DA0%W6=2DA0),ECN(R=Y%DF=N%T=41%W=3%O=%CC=N%Q=U),T1(R=Y%DF=N%T=41%S=O%A=S+%F=AS%RD=0%Q=),T2(R=Y%DF=N%T=41%W=80%S=A%A=Z%F=R%O=%RD=0%Q=),T3(R=Y%DF=N%T=41%W=100%S=A%A=Z%F=R%O=%RD=0%Q=),T4(R=Y%DF=N%T=1%W=400%S=A%A=Z%F=R%O=%RD=0%Q=),T5(R=Y%DF=N%T=41%W=7A69%S= Z%A=S+%F=AR%O=%RD=0%Q=),T6(R=Y%DF=N%T=41%W=8000%S=A%A=S%F=AR%O=%RD=0%Q=),T7(R=Y%DF=N%T=41%W=FFFF%S=Z%A=S+%F=AR%O= %RD=0%Q=),U1(R=Y%DF=N%T=41%IPL=38%UN=0%RIPL=G%RID=G%RIPCK=G%RUCK=G%RUD=G),IE(R=Y%DFI=N%T=41%CD=S)

It appears the Nmap tool used in the scan was modified to store the response used in the determination of the fingerprint. The standard Nmap program captured the operating system results, but does not keep the actual response used in fingerprinting. If the operating system results were captured during the internet census scans, they were not provided in the internet census data.

It is possible to write a program that compares the stored fingerprints with the Nmap nmap_os_db file to get an idea of what type of device may be tied to an IP address. However, not all the fingerprints have a high percentage of fields that match any template in the Nmap file. Also, a few of the stored fingerprints are malformed, not matching the standard format, and some also had random extraneous data embedded.

Traceroute

The traceroute scan provides the path a data packet took from a source address to the destination. Unlike the other scans, all traceroute data is stored in a single file. This makes sense as this data set has the least amount of entries. Each entry contains multiple IP addresses, usually from multiple CIDR /8 netblocks. The data includes a Unix format timestamp, the IP address of the device initiating the traceroute scan, the destination IP address, an indication of whether the scan was done using ICMP or UDP, and the route results. Below are examples of IP address listings in the traceroute file.

Timestamp Source IP Address Destination IP Address ICMP/UDP Result
1340158500 10.0.0.1 10.0.0.2 icmp 1:10.39.155.45:40ms,40ms,40ms;2::*,*,*;3::*,*,*;
1340158500 10.0.0.2 10.1.1.1 udp 1:10.0.0.130:30ms,30ms,30ms;2:10.164.28.78:20ms, 30ms, 30ms;3:10.223.44.229:30ms, 30ms, 30ms;

Most of the traceroutes were not able to capture a full route. Many capture only the first hop, often a 10/8 address, and many others just seem to peter out after several hops. There are relatively few that appear to be complete. It may be interesting to map out the complete traceroutes, but we have not yet found a reason to use this data in our analysis.

Recap

The Internet Census 2012 data set contains lots of data points. Several of the scan types may be interesting, but require much manual sifting or provide duplicate data with other scans.

  • Service probes did not store the most useful part of the data: the service running on a port.
  • Host probes data is redundant with ICMP ping.
  • Traceroutes have relatively few complete routes between different IP addresses; compared to other scans, relatively few IP addresses are represented.

The other scans are easier to use with automated analysis or provide significant data.

  • ICMP ping data can be used to quickly find IP addresses that were alive on at least one date.
  • Reverse DNS can be used to get an idea of IP addresses associated with specific domains or vice versa.
  • Syncscans provide an interesting set of IP addresses that have open ports.
  • TCP/IP fingerprints can be used to determine likely operating systems or devices tied to an IP address.

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed