Big Data Technology Selection: A Case Study

A recent IDC forecast predicts that the big data technology and services market will realize "a 26.4 percent compound annual growth rate to $41.5 billion through 2018, or about six times the growth rate of the overall information technology market." In previous posts highlighting the SEI's research in big data, we explored some of the challenges related to the rapidly growing field, which include the need to make technology selections early in the architecture design process. We introduced an approach to help the Department of Defense (DoD) and other enterprises develop and evolve systems to manage big data. The approach, known as Lightweight Evaluation and Architecture Prototyping for Big Data (LEAP4BD) helps organizations reduce risk and streamline selection and acquisition of big data technologies. In this blog post, we describe how we used LEAP4BD to help the Interagency Project Office achieve their mission to integrate the IT systems of the Military Health System and the Veterans Health Administration.

Our Approach to Big Data Risk Reduction: LEAP4BD

Before we delve into the case study, I would like to first highlight our research in big data, as well as the approach that we developed. In response to the volume, variety, and velocity challenges of big data, a new generation of scalable data management technologies has emerged. Relational database management systems, which provide strong data-consistency guarantees based on vertical scaling of compute and storage hardware, are being replaced by NoSQL (variously interpreted as "No SQL", or "Not Only SQL") data stores running on horizontally-scaled commodity hardware. These NoSQL databases achieve high scalability and performance using simpler data models, clusters of low-cost hardware, and mechanisms for relaxed data consistency that enhance performance and availability.

The rapid rise of NoSQL has created a complex and rapidly evolving landscape, with few standards and products offering radically different data models and capabilities. Selecting a database technology that can best support mission needs for timely development, cost-effective delivery, and future growth is not trivial. Using these new technologies to design and construct a massively scalable big data system creates an immense software architecture challenge for software architects and DoD program managers.

As highlighted in an earlier blog post, LEAP4BD was developed using principles drawn from proven architecture and technology analysis and evaluation techniques.

The LEAP4BD method has four steps:

Assess the system context and landscape. This includes identifying the applications fundamental data holdings, their relationships as well as the most frequent queries and access patterns. This step also entails identifying required performance and quantifies expected data and transaction growth.
Identify the architecturally-significant requirements and decision criteria. This step focuses on scalability, performance, security, availability, and data consistency, as well as engaging with stakeholders to characterize application quality attribute requirements.
Evaluate candidate technologies against quality attribute decision criteria. This step involves identifying and evaluating candidate technologies against the application's data and quality attribute requirements as well as selecting a small number of candidates (typically two to four) for validation through prototyping and testing. A trusted knowledge base, such as the SEI's Quality At Scale Knowledge Base for Big Data, can make this screening more systematic.
Validate architecture decisions and technology selections through focused prototyping. This step involved performing focused prototyping including go/no-go criteria as well as evaluating the prototype's behavior against a set of carefully designed, application-specific criteria (e.g., performance, scalability, etc.)

Also, this step involves generating evidence that can support the downstream investment decisions required to build, operate, and evolve the system. Also included is the qualitative evidence (e.g., data model fit, deployment options, etc.) and quantitative evidence (e.g., sensitivities and inflection points that highlights trends rather than absolutes).

Two Providers, Two Electronic Health Records Systems

The Interagency Project Office (IPO) was created in 2008 integrate the electronic health records (EHR) systems of the Military Health System and the Veterans Health Administration (VHA).

The Military Health System serves more than 9 million people including all DoD active duty military personnel, their beneficiaries and dependents, and military retirees. This system has unique requirements because, in addition to providing healthcare, it must also address force readiness, including immunizations, other standard health care services, and battlefield injuries and follow-up.
The Veterans Health Administration, a separate but related system, serves more than 5 million military veterans. The VHA operates on a different IT infrastructure from the military health system, so accessing a patient's medical care records from the time they are in the service until afterwards has in the past been very hard.

The IPO saw the promise of big data technology in their architecture to integrate these systems, but the agency did not have extensive experience with NoSQL databases. IPO also wanted an approach that would allow it to evaluate NoSQL databases without distracting its development teams. To address this challenge, the IPO contacted TATRAC, the U.S. Army's Telemedicine & Advanced Technology Research Center (TATRC), which in turn contacted the SEI.

Both systems posed numerous challenges to choosing a NoSQL database appropriate for this level of data. These challenges included the following:

Convergence of concerns. Data store technology and system architecture are intertwined, and technology selection cannot be deterred.
Rapidly changing technology landscape. New products are emerging along with multiple releases per year on existing products. Organizations face a need to balance speed with precision.
Large potential solution space. Given the large number of potential solutions, organizations need to quick narrow down and focus on a solution.
Scale makes full-fidelity prototyping impractical. The scale of data sets and systems leads to issues that include test data generation and storage; server acquisition, provisioning, and management; and workload generation and test execution.
Technology is highly configurable. There are a multitude of configuration options at the network, operating system, database, and application levels. It is easy to spend time optimizing a test configuration instead of focusing on go/no-go criteria.

We began by engaging with stakeholders to determine the key quality requirements through a Quality Attribute Workshop. We also conducted one-on-one stakeholder interviews. The workshops and interviews offered insight into the type of scalability needed for the systems, as well as the size of the data sets, which resulted in two driving use cases:

The first use case that a medical provider should be able to retrieve the results of the five most recent medical tests for a patient. This seemed simple to us at first, but we soon realized it was not.
The second use case required all readers to see the same data when a new medical test result is written, both within a facility and across facilities. This requirement implied strong consistency on updates.

Through these interviews and workshop we also identified that the core workload was read-intensive, with 80 percent read-only requests, and 20 percent write requests.

Selecting Candidate Technologies

As outlined in LEAP4BD, we next focused on selecting candidate technologies. To conduct prototyping we needed to first narrow the field, ideally to three or four NoSQL technologies that we could carry into the prototyping step. There are four main NoSQL data model categories: key value stores, document stores, wide column stores, and graph stores. Then, within each of those categories, there are dozens of potential products, each with different implementation and different strengths and weaknesses. We worked to identify one representative from each NoSQL category:

Riak (key value)
Cassandra (wide column)
Mongo DB (document)

We wanted to include a graph store database, but when we started looking at scalability requirements, we couldn't find any graph stores that would scale to meet their needs.

Focused Prototyping

Having picked the three representative NoSQL databases, we started to develop prototype implementations and execute a set of tests. We did this in collaboration with TATRC's Advanced Concepts Team. The TATRC team had the hands-on development expertise, so we split the work so that the SEI provided a virtual private cloud environment for the testing (Amazon Web Services) and also defined the tests and conducted the analysis, while the TATRC team developed and executed the tests.

Throughout this prototyping and testing, we were usually running 12 or so virtual machines, but there were points where we had as many as 35 virtual machines running at one time to conduct simultaneous testing on multiple configurations (e.g., working on more than one database or more than one configuration of a particular database.). The SEI virtual private cloud environment provided the elasticity we needed, and allowed our TATRC development partners direct access to the test systems.

Our tests used a synthetic data set provided by TATRC that reflected the types of data typically found in an electronic healthcare record system: patient records, observations, different types of tests and test results. The initial dataset wasn't large enough to sufficiently stress the databases, so we cloned each patient in the dataset several times and altered the name so it would result in a unique patient record.

In addition to setting up a cluster of database servers for each of the products we evaluated, we developed a test client that executed out test workloads, and collected the relevant performance data for analysis. The workloads included the following:

Initial data set upload - initializing the database?
Operational workloads - reading, writing, and retrieving test results for patients

Our test results our discussed in some detail in this paper, but I highlight some qualitative and quantitative findings below:

Our most significant qualitative result was a clear understanding of the limitations of the data query capabilities of the various NoSQL databases that we tested. These databases do not support typical SQL query concepts such as "order by" and "limit", and so retrieving the five most recent test results for a particular patient required careful design, matching each database's indexing capabilities with its query language capabilities to achieve the desired functionality.

A useful quantitative result was to measure the performance cost of strict replica consistency, which was a requirement in one of our use cases. Most NoSQL databases achieve their high performance partly through allowing system designers to selectively relax consistency in cases where it is not needed. In our testing, for the configurations and workloads we considered, we found that strong consistency reduced Cassandra throughput from 4,000 operations per second to 3,500 (approximately 12 percent reduction)

Wrapping Up and Looking Ahead

The application of LEAP4BD demonstrated that the focus on data storage and access services addressed the major risk areas for an application. Specifically, we learned that keeping the method lean rapidly produced insights that became the basis for downstream decisions.

Also, a highly transparent and systematic analysis and evaluation method significantly reduces the burden of justification for the necessary investments to build, deploy, and operate the application.

Finally, the hands-on experience of working with the technology during prototype development, reduced development risks. This type of experience also helped the IPO identify risks that must be mitigated in design and implementation along with detailed strategies and measures that allow for continual assessment.

Thus far, our research on big data systems has focused on runtime observability in NoSQL technologies. Our work applying our approach to military electronic health records highlighted other challenges in big data that organizations faces. For example, while there has been extensive investment in upfront design work, agencies and organizations often don't have great control over the data and its origin. Also, as new data inputs are added to the system and users find new ways to use the system, workloads change. Services running on a shared cloud infrastructure may not provide the anticipated or promised quality of service. Consequently, the system must be monitored end-to-end for performance and availability. In the coming months, our team of researchers plans to examine approaches for automating the monitoring of systems (e.g., insertion of monitors as well as the collection and aggregation of the data.)

Additional Resources

View the paper Application Specific Evaluation of NoSQL Databases that I co-authored with Ian Gorton, Neil Ernst, Kim Pham and Chrisjan Matser.

View the presentation A Systematic Presentation for Big Data Technology Selection, which I presented at the SEI's 2016 Software Solutions Conference.

Software Engineering Institute

SEI Blog