icon-carat-right menu search cmu-wordmark

Whitebox Monitoring with Prometheus

Joe Yankel

In the ever-changing world of DevOps, where micro-services and distributed architectures are becoming the norm, the need to understand application internal state is growing rapidly. Whitebox monitoring gives you details about the internal state of your application, such as the total number of HTTP requests on your web server or the number of errors logged. In contrast, blackbox testing (e.g., Nagios) allows you to check a system or application (e.g., checking disk space, or pinging a host) to see if a host or service is alive, but does not help you understand how it may have gotten to the current state. Prometheus is an open source whitebox monitoring solution that uses a time-series database to provide scraping, querying, graphing and alerting based on time-series data. This blog post briefly explores the benefits of using Prometheus as a whitebox monitoring tool.

Many of the organizations we work with use Nagios as their only monitoring tool for systems and applications. This it works well for its intended use, it does not provide complete details. Recently, in our test environment, Nagios alerted me that the disk space was 90% full. The disk space check worked as expected, but did not indicate why the disk was filling up. More importantly, it didn't indicate how much time I had until the system was no longer useable. Whitebox monitoring would allow me to see the rate at which the disk was filling, and this critical information could result in a good night's sleep for the system administrator if the trends imply the disk would be useable for the next 12 hours. The problem causing my application's disk to fill up was excessive logging caused by a lengthy database query. This discovery led to more discoveries, and I eventually narrowed the problem to two services we had running on separate machines.

Had we been using Prometheus to monitor several different metrics of our micro-services, we could have easily tracked down problems as we began to increase load and testing of our applications. It is simple to implement HTTP endpoints in our code for Prometheus to scrape. These metrics are stored in its database for later analysis, visualization, and alerting. There are already official client libraries written to match the language in which your application may be written, including Go, Java, Scala, Python, Ruby, and many more unofficial third-party libraries. The metrics that you must implement must be in a format that Prometheus can understand, but Prometheus maintains a number of exporters that help to export existing metrics from third-party systems, such as Linux system stats or Memcached. These exporters include the following and more:

  • Node/system metrics
  • Blackbox
  • Collectd
  • Consul
  • HAProxy
  • SNMP

There are also a large number of third-party exporters that can be found to convert existing system metrics into Prometheus metrics, thereby avoiding the need to instrument a given system with metrics directly.

We plan on including Prometheus metrics when developing applications in the future because they add value during the software development lifecycle. How easy is it to begin? I wanted to answer that question, so I decided to set up Prometheus and monitor an existing Django application. The simplest way to run Prometheus is to use its official Docker container, which can be found at https://hub.docker.com/u/prom. Within a few minutes, I was running running Prometheus, which by default collects data about itself to provide a working example of the system. There are Python client libraries available, but a Django exporter named django-prometheus already exists. This exporter is a module that provides Django monitoring metrics to Prometheus, and I quickly installed this tool to see some results. After a few lines of configuration, I was able to view metrics such as the total number of HTTP Ajax requests and the total number of http requests by view transport method. In less than 30 minutes, I was looking at metrics from our web application that I had little insight into before. Now I can go look into why so many Ajax requests are being made to a specific method that I didn't expect.

Prometheus is a great option for an open source whitebox monitoring solution for any organization. It is easy to set up and use, and it already has many third-party tools and a user base available to help get you started. We are planning to implement Prometheus metrics into many of our micro-services and applications in the future. If you are interested in using Prometheus, please contact us and we can provide more details on our experiences.

Additional Resources

To view the webinar DevOps Panel Discussion featuring Kevin Fall, Hasan Yasar, and Joseph D. Yankel, please click here.

To view the webinar Culture Shock: Unlocking DevOps with Collaboration and Communication with Aaron Volkmann and Todd Waits please click here.

To view the webinar What DevOps is Not! with Hasan Yasar and C. Aaron Cois, please click here.

To listen to the podcast DevOps--Transform Development and Operations for Fast, Secure Deployments featuring Gene Kim and Julia Allen, please click here.

To read all of the blog posts in our DevOps series, please click here.

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed