Improving Data Analysis with DevOps
Data analysis is complex and, at times, overwhelming. Automation increases an analysis team's ability to continuously improve their process. Specifically, the automation of software is the best way to manage all of the iteration and repetition that proper data analysis requires. DevOps is the perfect fit when planning a project that requires software, automation, and collaboration. In particular, DevOps improves all aspects of the data analysis process and allows teams to automate all software-based aspects of the data analysis process and effectively collaborate with all project stakeholders. In this blog post, I explore the ways in which DevOps improves data analysis.
Typically, an analysis team starts the data analysis process by defining their goals and identifying what they are trying to learn from a particular set of data. Then the team retrieves data from some source. The next step is to transform the data to ensure that it is analyzable. The complexity of this step is dependent on the data analysis technique that the project team decides to use to analyze their data set.
After the data is transformed appropriately, the analysis team begins data analysis. Data analysis includes determining what data analysis technique is required for this situation, understanding the dataset, performing the chosen data analysis techniques, and processing the results from the data analysis techniques. After data analysis is complete, the team will then draw conclusions from the results and start the data analysis process with changes in each stage to improve their analyses. It is important to note that each of these pieces of data analysis include lots of iteration and repetition, which leads to better analysis.
DevOps and Data Analysis
DevOps and automation greatly improve the data retrieval stage in the data analysis process. Once data the data retrieval process is automated, the analysis team will have a continuously growing data set, which leads to stronger conclusions. The automation technique for this stage depends on the data retrieval technique and can range from writing custom code or automating the functionality of a data retrieval tool. The benefits that DevOps provide may not be as clear in this particular phase of data analysis, but it does provide some nice benefits. For instance, it is important to store these automation scripts in a central repository so the data retrieval process is transparent to all project stakeholders and prevents headaches down the road, such as retrieving wrong or incomplete data.
Once data is automatically retrieved and the data set continues to grow, the analysis team needs to transform the data into an appropriate format. Automation improves the data transformation process by providing the analysis team with the ability to handle all the data that is being automatically retrieved. DevOps also allows the project team to maintain and update transformation scripts effectively and collaboratively with the use of a continuous integration (CI) server and a central code repository. Automated testing of these scripts is also performed by the CI server and ensures that the transformation scripts have the intended functionality.
After the data is in an analyzable format, the team starts their analysis. Analysis is a broad term and can mean lots of different things. All analysis teams have different analysis techniques and use a unique set of tools to reach their conclusions. Analysis techniques and tools also change with the type of data that is being analyzed. These analysis techniques range from writing Visual Basic for Applications (VBA) in Excel to leveraging sophisticated data analysis tools, such as Tetrad and BayesiaLab. Automation improves these techniques and allows for iteration and repetition, which enables the analysis team to run analyses on additional sets of data.
A proper DevOps solution contains many pieces that allow the analysis team to develop their analyses effectively and iteratively. For instance, a central code repository allows the team to manage any scripts that are necessary. The team is also able to automatically provision virtual environments that can be used to deploy and run analysis tools. A CI server manages the testing and deployment of scripts and analysis tools.
DevOps allows a project team to collaborate and automate for successful data analysis. Data analysis also requires iteration and repetition for which DevOps is an ideal solution. Before DevOps, a data analysis team had a list of instructions for each data analysis technique that each member of the project had to follow to perform data analysis. DevOps allows a project team to simply run an automated service that retrieves data, transforms it appropriately, performs data analysis techniques and returns the results to the entire project team.
View our latest SEI Podcast, How Risk Management Fits into Agile DevOps and Government.
To read all of the blog posts in our DevOps series, please click here.