Collecting Data, The DevOps Way
Data collection and storage are a large component of almost all software projects. Even though most software projects include a data component, this topic is rarely discussed in the DevOps community. The adoption rate of database continuous delivery (CD) is about half the rate of application CD. There are several reasons for this, but the primary one is that databases rarely change as often as applications do. There may be a few model changes, but generally there are no major architectural changes that occur in relation to the database level of your software. Many DevOps practitioners thus do not spend the time to provide continuous delivery of their data storage solutions, which became very apparent when our team was recently tasked to solve a complex problem. In this blog post, I will explore the application of DevOps principles to a data science project.
Our team was recently tasked with taking a very large set of data (~50 TBs) that contained interesting data that we wanted to leverage using machine and causal learning. We had to overcome three obstacles with this data:
- There was sensitive information laced throughout the data that we had to remove before we did any analysis.
- This dataset continues to grow, and we needed to make sure we had the ability to repeat this process.
- We needed to transform the data from unreadable, binary data to human digestible, CSV-formatted data. We also had to increase our data set by calculating derived fields and including them in our data set. This two-part transformation allowed us to more easily leverage learning tools and algorithms.
Essentially, we needed to come up with a solution that was not prone to human error and was repeatable and automated. Naturally, we turned to DevOps to help create a solution.
We started doing research for the best practices in the community and found that there were very few examples of this type of work in practice. We drew from the few available case studies and started to piece our solution together. We decided that we would build a pipeline that the data would travel through, and it would include all of the necessary scrubbing and transforming scripts.
At the end of this pipeline the data ended up in a PostgreSQL Database that could be queried using a custom web application leveraging the Django web framework. A user will then download the results of the query in a CSV file and upload that same CSV file into a machine-learning tool. All of the necessary scripts were built using a variety of technologies, primarily bash and python scripts. The scripts use the output of one script as the input to another, so that human interaction is not needed. Ultimately, we devised a very simple solution, but it provided exactly what was needed: repeatability, automation and very little human interaction.
A DevOps Solution to a Problem Not Typically Solved with DevOps
We quickly realized that we were missing a lot of pieces of DevOps that could have made this solution much better and much more in line with DevOps methodologies. For instance, we were not using a continuous integration server. We updated the continuous integration server, which allowed the team to continuously update and improve automation scripts without difficult code merges. We also realized that we needed to use the server to run a suite of automated tests. Running a suite of automated tests gave our team the confidence that our scripts are written well and do not include vulnerabilities. Of course, there are other things that we could do that would improve our solution. Since we are leveraging DevOps for this project, we can iteratively and continuously improve and alter our solution.
We are in the process of incorporating more DevOps principles to data collection. Some of the major updates will include deploying this infrastructure to a cloud environment and building a mature documentation process that will improve the traceability of the data. We learned that it is important to reach for a DevOps solution even if you fall short in the beginning. DevOps gives you the flexibility needed to iterate and continuously improve a solution throughout the project lifecycle.
To view the webinar DevOps Panel Discussion featuring Kevin Fall, Hasan Yasar, and Joseph D. Yankel, please click here.
To view the webinar Culture Shock: Unlocking DevOps with Collaboration and Communication with Aaron Volkmann and Todd Waits please click here.
To view the webinar What DevOps is Not! with Hasan Yasar and C. Aaron Cois, please click here.
To listen to the podcast DevOps--Transform Development and Operations for Fast, Secure Deployments featuring Gene Kim and Julia Allen, please click here.
To read all of the blog posts in our DevOps series, please click here.