icon-carat-right menu search cmu-wordmark

Collecting Data, The DevOps Way

Kiriakos Kontostathis

Our team was recently tasked with taking a very large set of data (~50 TBs) that contained interesting data that we wanted to leverage using machine and causal learning. We had to overcome three obstacles with this data:

  • There was sensitive information laced throughout the data that we had to remove before we did any analysis.
  • This dataset continues to grow, and we needed to make sure we had the ability to repeat this process.
  • We needed to transform the data from unreadable, binary data to human digestible, CSV-formatted data. We also had to increase our data set by calculating derived fields and including them in our data set. This two-part transformation allowed us to more easily leverage learning tools and algorithms.

Essentially, we needed to come up with a solution that was not prone to human error and was repeatable and automated. Naturally, we turned to DevOps to help create a solution.

We started doing research for the best practices in the community and found that there were very few examples of this type of work in practice. We drew from the few available case studies and started to piece our solution together. We decided that we would build a pipeline that the data would travel through, and it would include all of the necessary scrubbing and transforming scripts.

At the end of this pipeline the data ended up in a PostgreSQL Database that could be queried using a custom web application leveraging the Django web framework. A user will then download the results of the query in a CSV file and upload that same CSV file into a machine-learning tool. All of the necessary scripts were built using a variety of technologies, primarily bash and python scripts. The scripts use the output of one script as the input to another, so that human interaction is not needed. Ultimately, we devised a very simple solution, but it provided exactly what was needed: repeatability, automation and very little human interaction.

A DevOps Solution to a Problem Not Typically Solved with DevOps

We are in the process of incorporating more DevOps principles to data collection. Some of the major updates will include deploying this infrastructure to a cloud environment and building a mature documentation process that will improve the traceability of the data. We learned that it is important to reach for a DevOps solution even if you fall short in the beginning. DevOps gives you the flexibility needed to iterate and continuously improve a solution throughout the project lifecycle.

Additional Resources

To view the webinar DevOps Panel Discussion featuring Kevin Fall, Hasan Yasar, and Joseph D. Yankel, please click here.

To view the webinar Culture Shock: Unlocking DevOps with Collaboration and Communication with Aaron Volkmann and Todd Waits please click here.

To view the webinar What DevOps is Not! with Hasan Yasar and C. Aaron Cois, please click here.

To listen to the podcast DevOps--Transform Development and Operations for Fast, Secure Deployments featuring Gene Kim and Julia Allen, please click here.

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed