Collecting Data, The DevOps Way

Our team was recently tasked with taking a very large set of data (~50 TBs) that contained interesting data that we wanted to leverage using machine and causal learning. We had to overcome three obstacles with this data:

There was sensitive information laced throughout the data that we had to remove before we did any analysis.
This dataset continues to grow, and we needed to make sure we had the ability to repeat this process.
We needed to transform the data from unreadable, binary data to human digestible, CSV-formatted data. We also had to increase our data set by calculating derived fields and including them in our data set. This two-part transformation allowed us to more easily leverage learning tools and algorithms.

Essentially, we needed to come up with a solution that was not prone to human error and was repeatable and automated. Naturally, we turned to DevOps to help create a solution.

We started doing research for the best practices in the community and found that there were very few examples of this type of work in practice. We drew from the few available case studies and started to piece our solution together. We decided that we would build a pipeline that the data would travel through, and it would include all of the necessary scrubbing and transforming scripts.

At the end of this pipeline the data ended up in a PostgreSQL Database that could be queried using a custom web application leveraging the Django web framework. A user will then download the results of the query in a CSV file and upload that same CSV file into a machine-learning tool. All of the necessary scripts were built using a variety of technologies, primarily bash and python scripts. The scripts use the output of one script as the input to another, so that human interaction is not needed. Ultimately, we devised a very simple solution, but it provided exactly what was needed: repeatability, automation and very little human interaction.

A DevOps Solution to a Problem Not Typically Solved with DevOps

We are in the process of incorporating more DevOps principles to data collection. Some of the major updates will include deploying this infrastructure to a cloud environment and building a mature documentation process that will improve the traceability of the data. We learned that it is important to reach for a DevOps solution even if you fall short in the beginning. DevOps gives you the flexibility needed to iterate and continuously improve a solution throughout the project lifecycle.

Additional Resources

To view the webinar DevOps Panel Discussion featuring Kevin Fall, Hasan Yasar, and Joseph D. Yankel, please click here.

To view the webinar Culture Shock: Unlocking DevOps with Collaboration and Communication with Aaron Volkmann and Todd Waits please click here.

To view the webinar What DevOps is Not! with Hasan Yasar and C. Aaron Cois, please click here.

To listen to the podcast DevOps--Transform Development and Operations for Fast, Secure Deployments featuring Gene Kim and Julia Allen, please click here.

Software Engineering Institute

SEI Blog

Collecting Data, The DevOps Way

Kostas Kontogiannis

November 21, 2017

PUBLISHED IN

CITE

TAGS

SHARE

Written By

Kostas Kontogiannis

Author Page

Digital Library Publications

Send a Message

More By The Author

Improving Data Analysis with DevOps

April 27, 2018 • By Kostas Kontogiannis

Spreading Security with Overcommit

March 9, 2017 • By Kostas Kontogiannis

Security...Security Everywhere

May 9, 2016 • By Kostas Kontogiannis

Adding Security to Your DevOps Pipeline

January 22, 2016 • By Kostas Kontogiannis

More In DevSecOps

Example Case: Using DevSecOps to Redefine Minimum Viable Product

March 11, 2024 • By Joe Yankel

Acquisition Archetypes Seen in the Wild, DevSecOps Edition: Clinging to the Old Ways

December 18, 2023 • By William E. Novak

Extending Agile and DevSecOps to Improve Efforts Tangential to Software Product Development

August 7, 2023 • By David Sweeney, Lyndsi A. Hughes

5 Challenges to Implementing DevSecOps and How to Overcome Them

June 12, 2023 • By Joe Yankel, Hasan Yasar

Actionable Data from the DevSecOps Pipeline

May 1, 2023 • By Bill Nichols, Julie B. Cohen