The Importance of Software Architecture in Big Data Systems
Many types of software systems, including big data applications, lend them themselves to highly incremental and iterative development approaches. In essence, system requirements are addressed in small batches, enabling the delivery of functional releases of the system at the end of every increment, typically once a month. The advantages of this approach are many and varied. Perhaps foremost is the fact that it constantly forces the validation of requirements and designs before too much progress is made in inappropriate directions. Ambiguity and change in requirements, as well as uncertainty in design approaches, can be rapidly explored through working software systems, not simply models and documents. Necessary modifications can be carried out efficiently and cost-effectively through refactoring before code becomes too 'baked' and complex to easily change. This posting, the second in a series addressing the software engineering challenges of big data, explores how the nature of building highly scalable, long-lived big data applications influences iterative and incremental design approaches.
Iterative, incremental development approaches are embodied in agile development methods, such as XP and Scrum. While the details of each approach differ, the notion of evolutionary design is at the core of each. Agile software architects eschew large, planned design efforts (also known as Big Design Up Front), in lieu of just enough design to meet deliverable goals for an iteration. Design modifications and improvements occur as each iteration progresses, providing on-going course corrections to the architecture and ensuring that only features that support the current targeted functionality are developed.
Martin Fowler provides an excellent description of the pros and cons of this approach. He emphasizes the importance of test-driven development and continuous integration as key practices that make evolutionary design feasible. In a similar vein, at the SEI we are developing an architecture-focused approach that can lead to more informed system design decisions that balance short-term needs with long-term quality.
Evolutionary, emergent design encourages lean solutions and avoids over-engineered features and software architectures. This design approach limits time spent on tasks such as updating lengthy design documentation. The aim is to deliver, in as streamlined a manner as possible, a system that meets its requirements.
There is, of course, an underlying assumption that must hold for evolutionary design to be effective: change is cheap. Changes that are fast to make can easily be accommodated within short development cycles. Not all changes are cheap, however. Cyber-physical systems, where hardware-software interfaces are dominant, offer prominent examples of systems in which hardware modifications or unanticipated deployment environments can lead to changes with long development cycles. Other types of change can be expensive in purely software systems, as well.
For example, poorly documented, tightly coupled, legacy code can rarely be successfully replaced in a single iteration. Incorporating new, third-party components or subsystems can involve lengthy evaluation, prototyping, and development cycles, especially when negotiations with vendors are involved. Likewise, architectural changes--for example, moving from a master-slave to a peer-to-peer deployment architecture to improve scalability--regularly require a fundamental and widespread re-design and refactoring that must be spread judiciously over several development iterations.
Evolutionary Design and Big Data Applications
As we described in a previous blog post, our research focuses on addressing the challenges of building highly scalable big data systems. In these systems, the requirements for extreme scalability, performance, and availability introduce complexities that require new design approaches from the software engineering community. Big data solutions must adopt highly distributed architectures with data collections that are (1) partitioned over many nodes in clusters to enhance scalability and (2) replicated to increase availability in the face of hardware and network failures. NoSQL distributed database architectures provide many of the capabilities that make scalability feasible at acceptable costs. They also introduce inherent complexities that force applications to perform the following tasks:
- explicitly handle data consistency
- tolerate a wide range of hardware and software faults
- track component monitoring and performance measurement so that operators have visibility into the behavior of the deployment
Due to the size of their deployment footprint, big data applications are often deployed on virtualized, cloud platforms. Clouds platforms are many and varied in their nature, but generally offer a set of services for application configuration, deployment, security, data management, monitoring, and billing for use of processor, disk, and network resources. A number of cloud platforms from service providers, such as Amazon Web Services and Heroku, as well as open-source systems, such as OpenStack and Eucalyptus, are available for deploying big data applications in the cloud.
In this context of big data applications deployed on cloud platforms, it's interesting to examine the notion of evolutionary system design in an iterative and incremental development project. Recall that evolutionary design is effective as long as change is cheap. Hence, are there elements of big data applications where change is unlikely to be a straightforward task, and that might, in turn, require major rework and perhaps even fundamental architecture changes for an application?
We posit that there are two main areas in big data applications where change is likely so expensive and complex that it warrants a judicious upfront architecture design effort. These two areas revolve around changes to data management and cloud deployment technologies:
- Data Management Technologies. For many years, relational database technologies dominated data management systems. With a standard data model and query language, competitive relational database technologies share many traits, which makes moving to another platform or introducing another database into an application relatively straightforward. In the last five years, NoSQL databases have emerged as foundational building blocks for big data applications. This diverse collection of NoSQL technologies eschews standardized data models and query languages. Each technology employs radically different distributed data management mechanisms to build highly scalable, available systems. With different data models, proprietary application programming interfaces (APIs) and totally different runtime characteristics, any transition from one NoSQL database to another will likely have fundamental and widespread impacts on any code base.
- Cloud Deployments. Cloud platforms come in many shapes and sizes. Public cloud services provide hosting infrastructures for virtualized applications and offer sophisticated software and hardware platforms that support pay-as-you-use cost models. Private cloud platforms enable organizations to create clouds behind their corporate firewalls. Again, private clouds offer sophisticated mechanisms for hosting virtualized applications on clusters managed by the development organization. Like NoSQL databases, little commonality exists between various public and private cloud offerings, making a migration across platforms a daunting proposition with pervasive implications for application architectures. In fact, a whole genre of dedicated cloud migration technologies, including Yuruware and Racemi, is emerging to address this problem. Where opportunities for new tools such as these exist, the problem they are addressing is likely not something that can be readily accommodated in an evolutionary design approach.
Our Lightweight Evaluation and Architecture Prototyping for Big Data (LEAP4BD) method reduces the risks of needing to migrate to a new database management system by ensuring a thorough evaluation of the solution space is carried out in the minimum of time and with minimum effort. LEAP(4BD) provides a systematic approach for a project to select a NoSQL database that can satisfy its requirements. This approach is amenable to iterative and incremental design approaches, because it can be phased across one or more increments to suit the project's development tempo.
A key feature of LEAP(4BD) is its NoSQL database feature evaluation criteria. This ready-made set of criteria significantly speeds up a NoSQL database evaluation and acquisition effort. To this end, we have categorized the major characteristics of data management technologies based upon the following areas:
- Query Language--characterizes the API and specific data manipulation features supported by a NoSQL database
- Data Model--categorizes core data organization principles provided by a NoSQL database
- Data Distribution--analyzes the software architecture and mechanisms that are used by a NoSQL database to distribute data
- Data Replication--determines how a NoSQL database facilitates reliable, high performance data replication
- Consistency--categorizes the consistency model(s) that a NoSQL database offers
- Scalability--captures the core architecture and mechanisms that support scaling a big data application in terms of both data and request load increases
- Performance--assesses mechanisms used to provide high-performance data access
- Availability--determines mechanisms that a NoSQL database uses to provide high availability in the face of hardware and software failures
- Modifiability--questions whether an application data model be easily evolved and how that evolution impacts clients
- Administration and Management--categorizes and describe the tools provided by a NoSQL database to support system administration, monitoring and management
Within each of these categories, we have detailed evaluation criteria that can be used to differentiate big data technologies. For example, here's an extract from the Data Model evaluation criteria:
- Data Model style
- Data item identification a. Key-value for each field
- b. Objects in same store can have variable formats types stored
- c. Opaque data items that needs application interpretation
- d. Fixed or variable schema
- e. Embedded hierarchical data items supported (e.g. sub documents)
- Data Item key
a. Automatically allocated
b. Composite keys supported
c. Secondary indexes supported
d. Querying on non-key metadata supported
- Query Styles
a. Query by key
b. Query by partial key
c. Query by non-key values
d. Text search in data items
In LEAP(4BD), we first work with the project team to identify features pertinent to the system under development. These features help identify a specific set of technologies that will best support the system. From there, we weight individual features according to system requirements and evaluate each candidate technology against these features.
LEAP(4BD) is supported by a knowledge base that stores the results of our evaluations and comparisons of different NoSQL databases. We have pre-populated the LEAP(4BD) knowledge base with evaluations of specific technologies (e.g., MongoDB, Cassandra, and Riak) with which we have extensive experience. Each evaluation of a new technology adds to this knowledge base, making evaluations more streamlined as the knowledge base grows. Overall, this approach provides a systematic, quantitative, and highly transparent approach that quickly provides a ranking of the various candidate technologies according to project requirements.
As we have demonstrated thus far in this series, there are many facets to LEAP(4BD). The next post in this series on Big Data will explain the prototyping phase. In the meantime, we're keen to hear from developers and architects who are evaluating big data technologies, so please feel free to share your thoughts in the comments section below.
To listen to the podcast, An Approach to Managing the Software Engineering Challenges of Big Data, please visit https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=294249