Posted on by Big Datain
By John Klein Senior Member of the Technical Staff Architecture Practices Initiative
A recent IDC forecast predicts that the big data technology and services market will realize "a 26.4 percent compound annual growth rate to $41.5 billion through 2018, or about six times the growth rate of the overall information technology market." In previous posts highlighting the SEI's research in big data, we explored some of the challenges related to the rapidly growing field, which include the need to make technology selections early in the architecture design process. We introduced an approach to help the Department of Defense (DoD) and other enterprises develop and evolve systems to manage big data. The approach, known as Lightweight Evaluation and Architecture Prototyping for Big Data (LEAP4BD) helps organizations reduce risk and streamline selection and acquisition of big data technologies. In this blog post, we describe how we used LEAP4BD to help the Interagency Project Office achieve their mission to integrate the IT systems of the Military Health System and the Veterans Health Administration.
Our Approach to Big Data Risk Reduction: LEAP4BD
Before we delve into the case study, I would like to first highlight our research in big data, as well as the approach that we developed. In response to the volume, variety, and velocity challenges of big data, a new generation of scalable data management technologies has emerged. Relational database management systems, which provide strong data-consistency guarantees based on vertical scaling of compute and storage hardware, are being replaced by NoSQL (variously interpreted as "No SQL", or "Not Only SQL") data stores running on horizontally-scaled commodity hardware. These NoSQL databases achieve high scalability and performance using simpler data models, clusters of low-cost hardware, and mechanisms for relaxed data consistency that enhance performance and availability.
The rapid rise of NoSQL has created a complex and rapidly evolving landscape, with few standards and products offering radically different data models and capabilities. Selecting a database technology that can best support mission needs for timely development, cost-effective delivery, and future growth is not trivial. Using these new technologies to design and construct a massively scalable big data system creates an immense software architecture challenge for software architects and DoD program managers.
As highlighted in an earlier blog post, LEAP4BD was developed using principles drawn from proven architecture and technology analysis and evaluation techniques.
The LEAP4BD method has four steps:
Two Providers, Two Electronic Health Records Systems
The Interagency Project Office (IPO) was created in 2008 integrate the electronic health records (EHR) systems of the Military Health System and the Veterans Health Administration (VHA).
The IPO saw the promise of big data technology in their architecture to integrate these systems, but the agency did not have extensive experience with NoSQL databases. IPO also wanted an approach that would allow it to evaluate NoSQL databases without distracting its development teams. To address this challenge, the IPO contacted TATRAC, the U.S. Army's Telemedicine & Advanced Technology Research Center (TATRC), which in turn contacted the SEI.
Both systems posed numerous challenges to choosing a NoSQL database appropriate for this level of data. These challenges included the following:
We began by engaging with stakeholders to determine the key quality requirements through a Quality Attribute Workshop. We also conducted one-on-one stakeholder interviews. The workshops and interviews offered insight into the type of scalability needed for the systems, as well as the size of the data sets, which resulted in two driving use cases:
Through these interviews and workshop we also identified that the core workload was read-intensive, with 80 percent read-only requests, and 20 percent write requests.
Selecting Candidate Technologies
As outlined in LEAP4BD, we next focused on selecting candidate technologies. To conduct prototyping we needed to first narrow the field, ideally to three or four NoSQL technologies that we could carry into the prototyping step. There are four main NoSQL data model categories: key value stores, document stores, wide column stores, and graph stores. Then, within each of those categories, there are dozens of potential products, each with different implementation and different strengths and weaknesses. We worked to identify one representative from each NoSQL category:
We wanted to include a graph store database, but when we started looking at scalability requirements, we couldn't find any graph stores that would scale to meet their needs.
Having picked the three representative NoSQL databases, we started to develop prototype implementations and execute a set of tests. We did this in collaboration with TATRC's Advanced Concepts Team. The TATRC team had the hands-on development expertise, so we split the work so that the SEI provided a virtual private cloud environment for the testing (Amazon Web Services) and also defined the tests and conducted the analysis, while the TATRC team developed and executed the tests.
Throughout this prototyping and testing, we were usually running 12 or so virtual machines, but there were points where we had as many as 35 virtual machines running at one time to conduct simultaneous testing on multiple configurations (e.g., working on more than one database or more than one configuration of a particular database.). The SEI virtual private cloud environment provided the elasticity we needed, and allowed our TATRC development partners direct access to the test systems.
Our tests used a synthetic data set provided by TATRC that reflected the types of data typically found in an electronic healthcare record system: patient records, observations, different types of tests and test results. The initial dataset wasn't large enough to sufficiently stress the databases, so we cloned each patient in the dataset several times and altered the name so it would result in a unique patient record.
In addition to setting up a cluster of database servers for each of the products we evaluated, we developed a test client that executed out test workloads, and collected the relevant performance data for analysis. The workloads included the following:
Our test results our discussed in some detail in this paper, but I highlight some qualitative and quantitative findings below:
Wrapping Up and Looking Ahead
The application of LEAP4BD demonstrated that the focus on data storage and access services addressed the major risk areas for an application. Specifically, we learned that keeping the method lean rapidly produced insights that became the basis for downstream decisions.
Also, a highly transparent and systematic analysis and evaluation method significantly reduces the burden of justification for the necessary investments to build, deploy, and operate the application.
Finally, the hands-on experience of working with the technology during prototype development, reduced development risks. This type of experience also helped the IPO identify risks that must be mitigated in design and implementation along with detailed strategies and measures that allow for continual assessment.
Thus far, our research on big data systems has focused on runtime observability in NoSQL technologies. Our work applying our approach to military electronic health records highlighted other challenges in big data that organizations faces. For example, while there has been extensive investment in upfront design work, agencies and organizations often don't have great control over the data and its origin. Also, as new data inputs are added to the system and users find new ways to use the system, workloads change. Services running on a shared cloud infrastructure may not provide the anticipated or promised quality of service. Consequently, the system must be monitored end-to-end for performance and availability. In the coming months, our team of researchers plans to examine approaches for automating the monitoring of systems (e.g., insertion of monitors as well as the collection and aggregation of the data.)
We welcome your feedback on this research in the comments section below.
View the paper Application Specific Evaluation of NoSQL Databases that I co-authored with Ian Gorton, Neil Ernst, Kim Pham and Chrisjan Matser.
View the presentation A Systematic Presentation for Big Data Technology Selection, which I presented at the SEI's 2016 Software Solutions Conference.