search menu icon-carat-right cmu-wordmark

Addressing the Software Engineering Challenges of Big Data

Headshot of Ian Gorton

New data sources, ranging from diverse business transactions to social media, high-resolution sensors, and the Internet of Things, are creating a digital tidal wave of big data that must be captured, processed, integrated, analyzed, and archived. Big data systems storing and analyzing petabytes of data are becoming increasingly common in many application areas. These systems represent major, long-term investments requiring considerable financial commitments and massive scale software and system deployments.

With analysts estimating data storage growth at 30 to 60 percent per year, organizations must develop a long-term strategy to address the challenge of managing projects that analyze exponentially growing data sets with predictable, linear costs. This blog post describes a lightweight risk reduction approach called Lightweight Evaluation and Architecture Prototyping (for Big Data) we developed with fellow researchers at the SEI. The approach is based on principles drawn from proven architecture and technology analysis and evaluation techniques to help the Department of Defense (DoD) and other enterprises develop and evolve systems to manage big data.

The Challenges of Big Data

For the DoD, the challenges of big data are daunting. Military operations, intelligence analysis, logistics, and health care all represent big data applications with data growing at exponential rates and the need for scalable software solutions to sustain future operations. In 2012, the DoD announced a $250 million annual research and development investment in big data that is targeted at specific mission needs such as autonomous systems and decision support. For example, in Data-to-Decisions S&T Priority Initiative, the DoD has developed a roadmap through 2018 that identifies the need for distributed, multi-petabyte data stores that can underpin mission needs for scalable knowledge discovery, analytics, and distributed computations.

The following examples illustrate two different DoD data-intensive missions:

  1. Electronic health records. The Military Health System provides care for more than 9.7 million active-duty personnel, their dependents, and retirees. The 15-year-old repository operates a continouously growing petascale database, with more than 100 application interfaces. System workload types include care delivery, force readiness, and research/analytics. The system does not currently meet the target of 24/7 availability.
  2. Flight data management. Modern military avionics systems capture tens of gigabytes (GBs) of data per hour of operation. This data is collected in in-flight data analysis systems, which perform data filtering and organization, correlation with other data sources, and identification of significant events. These capabilities support user-driven analytics for root cause detection and defect prediction to reduce engineering and maintenance costs.

To address these big data challenges, a new generation of scalable data management technologies has emerged in the last five years. Relational database management systems, which provide strong data-consistency guarantees based on vertical scaling of compute and storage hardware, are being replaced by NoSQL (variously interpreted as "No SQL", or "Not Only SQL") data stores running on horizontally-scaled commodity hardware. These NoSQL databases achieve high scalability and performance using simpler data models, clusters of low-cost hardware, and mechanisms for relaxed data consistency that enhance performance and availability.

A challenging technology adoption problem is being created, however, by a complex and rapidly evolving landscape of non-standardized technologies built on radically different data models. Selecting a database technology that can best support mission needs for timely development, cost-effective delivery, and future growth is non-trivial. Using these new technologies to design and construct a massively scalable big data system creates an immense software architecture challenge for software architects and DoD program managers.

Why Scale Matters in Big Data Management

Scale has many implications for software architecture, and we describe two of them in this blog post. The first revolves around the fundamental changes that scale enforces on how we design software systems. The second is based upon economics, where small optimizations in resource usage at very large scales can lead to huge cost reductions in absolute terms. The following briefly explores these two issues:

Designing for scale. Big data systems are inherently distributed systems. Hence, software architects must explicitly deal with issues of partial failures, unpredictable communications latencies, concurrency, consistency, and replication in the system design. These issues are exacerbated as systems grow to utilize thousands of processing nodes and disks, geographically distributed across data centers. For example, the probability of failure of a hardware component increases with scale.

Studies such as this 2010 research paper, "Characterizing Cloud Computing Hardware Reliability," find that 8 percent of all servers in a data center experience a hardware problem annually, with the most common cause being disk failure. In addition, applications must deal with unpredictable communication latencies and network partitions due to link failures. These requirements mandate that scalable applications treat failures as common events that must be handled gracefully to ensure that the application operation is not interrupted. To address such requirements, resilient big data software architectures must:

  • Replicate data across clusters and data centers to ensure availability in the case of disk failure or network partitions. Replicas must be kept consistent using either master-slave or multi-master protocols. The latter requires mechanisms to handle inconsistencies due to concurrent writes, typically based on Lamport clocks.
  • Design components to be stateless and replicated and to tolerate failures by dependent services, for example, by using the Circuit Breaker pattern described by Michael T. Nygard in his book Release IT and returning cached or default results whenever failures are detected. This pattern ensures that failures do not rapidly propagate across components and allow applications an opportunity for recovery.

Economics at scale. Big data applications employ many thousands of compute-and-storage resources. Regardless of whether these resources are capital purchases or resources hosted by a commercial cloud provider, they remain a major cost and hence a target for reduction. Straightforward resource reduction approaches (such as data compression) are common ways to reduce storage costs. Elasticity is another way that big data applications optimize resource usage, by dynamically deploying new servers to handle increases in load and releasing them as load decreases. Elastic solutions require servers that boot quickly and application-specific strategies for avoiding premature resource release.

Other strategies seek to optimize the performance of common tools and components to maintain productivity while decreasing resource utilization. For example, Facebook built HipHop, a PHP-to-C++ transformation engine that reduced its CPU load for serving web pages by 50 percent. At the scale of Facebook's deployment, this represents a very significant resource reduction and cost savings. Other targets for reduction are software license costs that can become cost prohibitive at scale. Cost reduction has seen a proliferation of database and middleware technologies developed recently by leading internet organizations, and many of these have been released as freely available open source. Netflix and Linkedin provide examples of powerful, scalable technologies for big data systems.

Other implications of scale for big data software architectures revolve around testing and fault diagnosis. Due to the deployment footprint of applications and the massive data sets they manage, it's impossible to create comprehensive test environments to validate software changes before deployment. Approaches such as canary testing and simian armies represent the state of the art in testing at scale. When the inevitable problems occur in production, rapid diagnosis can only be achieved by advanced monitoring and logging. In a large-scale system, log analysis itself can quickly become a big data problem as log sizes can easily reach hundreds of GBs per day. Logging solutions must include a low overhead, scalable infrastructure such as Blitz4J, and the ability to rapidly reconfigure a system to redirect requests away from faulty services.

Necessarily large investments and magnified risks that accompany the construction of massive, scalable data management and analysis systems exacerbate these challenges of scale. For this reason, software engineering approaches that explicitly address the fundamental issues of scale and new technologies are a prerequisite for project success.

Designing for Scalability with Big Data

To mitigate the risks associated with scale and technology, a systematic, iterative approach is needed to ensure that initial design models and database selections can support the long-term scalability and analysis needs of a big data application. A modest investment in upfront design can produce unprecedented returns on investment in terms of greatly reduced redesign, implementation, and operational costs over the long lifetime of a large-scale system.

Because the scale of the target system prevents the creation of full-fidelity prototypes, a well-structured software engineering approach is needed to frame the technical issues, identify the architecture decision criteria, and rapidly construct and execute relevant but focused prototypes. Without this structured approach, it is easy to fall into the trap of chasing after a deep understanding of the underlying technology instead of answering the key go/no-go questions about a particular candidate technology. Getting the right decisions for the minimum cost should be the aim of this exercise.

At the SEI, we have developed a lightweight risk reduction approach that we have initially named Lightweight Evaluation and Architecture Prototyping (for Big Data), or LEAP(4BD). Our approach is based on principles drawn from proven architecture and technology analysis and evaluation techniques such as the Quality Attribute Workshop and the Architecture Tradeoff Analysis Method. LEAP(4BD) leverages our extensive experience with architecture-based design and assessment and customizes these methods with deep knowledge and experience of the architectural and database technology issues most pertinent to big data systems. Working with an organization's key business and technical stakeholders, this approach involves the following general guidelines:

  • Assess the existing and future data landscape. This step identifies the application's fundamental data holdings, their relationships, the most frequent queries and access patterns, and their required performance and quantifies expected data and transaction growth. The outcome sets the scope and context for the rest of the analysis and evaluation and provides initial insights into the suitability of a range of contemporary data models (e.g., key-value, graph, document-oriented, column-oriented, etc.) that can support the application's requirements.
  • Identify the architecturally-significant requirements and develop decision criteria. Focusing on scalability, performance, security, availability, and data consistency, stakeholders characterize the application's quality attribute requirements that drive the system architecture and big data technology selection. Combining these architecture requirements with the characteristics of the data model (previous step) provides the necessary information for initial architecture design and technology selection.
  • Evaluate candidate technologies against quality attribute decision criteria. Working with the system architects, this step identifies and evaluates candidate big data technologies against the applications' data and quality attribute requirements, and selects a small number of candidates (typically two to four) for validation through prototyping and testing. The evaluation is streamlined by using an evaluation criteria framework for big data technologies that we are developing as part of our internal R&D. This framework focuses the assessment activities for specific database products against a generic collection of behavioral attributes and measures.
  • Validate architecture decisions and technology selections. Through focused prototyping, our approach ensures that the system design and selected database technologies can meet the defined quality attribute needs. By evaluating the prototype's behavior against a set of carefully designed, application-specific criteria (e.g., performance, scalability, etc.), this step provides concrete evidence that can support the downstream investment decisions required to build, operate, and evolve the system. During the construction and execution of the prototypes, the project team develops experience working with the specific big data technologies under consideration.

Benefits

LEAP(4BD) provides a rigorous methodology for organizations to design enduring big data management systems that can scale and evolve to meet long-term requirements. The key benefits are:

  1. A focus on the database and key analysis services addresses the major risk areas for an application. This keeps the method lean and produces rapid design insights that become the basis for downstream decisions.
  2. A highly transparent and systematic analysis and evaluation method significantly reduces the burden of justification for the necessary investments to build, deploy, and operate the application.
  3. Maximizing the potential for leveraging modern big data technologies reduces costs and ensures that an application can satisfy its quality attribute requirements.
  4. Greatly increased confidence in architecture design and database technology selection, and hands-on experience working with the technology during prototype development, reduces development risks.
  5. The identification of outstanding project risks that must be mitigated in design and implementation, along with detailed mitigation strategies and measures that allow for continual assessment.

Looking Ahead

We are currently piloting LEAP(4BD) with a federal agency. This project involves both scalable architecture design and focused NoSQL database technology benchmarking, as well as an assessment of features to meet the key quality attributes for scalable big data systems.

We are interested in working with organizational leaders who want to ensure appropriate technology selection and software architecture design for their big data systems. If you are interested in collaborating with us on this research, please leave us feedback in the comments section below or send an email to us at info@sei.cmu.edu.

Additional References

To read the article "Time, clocks, and the ordering of events in a distributed system" by Leslie Lamport, please visit
https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fum%2Fpeople%2Flamport%2Fpubs%2Ftime-clocks.pdf

To read the paper "Characterizing cloud computing hardware reliability," by Kashi Venkatesh Vishwanath and Nachiappan Nagappan, which was presented at the Proceedings of the First Association for Computing Machinery Symposium on Cloud Computing, please visit
https://dl.acm.org/citation.cfm?doid=1807128.1807161

To read more about the challenges of ultra-large-scale systems please visit
https://resources.sei.cmu.edu/library/index.cfm?fp=sei_topic:Ultra-Large-Scale+Systems&global=true

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed