Posted on by Architecturein
Many types of software systems, including big data applications, lend them themselves to highly incremental and iterative development approaches. In essence, system requirements are addressed in small batches, enabling the delivery of functional releases of the system at the end of every increment, typically once a month. The advantages of this approach are many and varied. Perhaps foremost is the fact that it constantly forces the validation of requirements and designs before too much progress is made in inappropriate directions. Ambiguity and change in requirements, as well as uncertainty in design approaches, can be rapidly explored through working software systems, not simply models and documents. Necessary modifications can be carried out efficiently and cost-effectively through refactoring before code becomes too 'baked' and complex to easily change. This posting, the second in a series addressing the software engineering challenges of big data, explores how the nature of building highly scalable, long-lived big data applications influences iterative and incremental design approaches.
Iterative, incremental development approaches are embodied in agile development methods, such as XP and Scrum. While the details of each approach differ, the notion of evolutionary design is at the core of each. Agile software architects eschew large, planned design efforts (also known as Big Design Up Front), in lieu of just enough design to meet deliverable goals for an iteration. Design modifications and improvements occur as each iteration progresses, providing on-going course corrections to the architecture and ensuring that only features that support the current targeted functionality are developed.
Martin Fowler provides an excellent description of the pros and cons of this approach. He emphasizes the importance of test-driven development and continuous integration as key practices that make evolutionary design feasible. In a similar vein, at the SEI we are developing an architecture-focused approach that can lead to more informed system design decisions that balance short-term needs with long-term quality.
Evolutionary, emergent design encourages lean solutions and avoids over-engineered features and software architectures. This design approach limits time spent on tasks such as updating lengthy design documentation. The aim is to deliver, in as streamlined a manner as possible, a system that meets its requirements.
There is, of course, an underlying assumption that must hold for evolutionary design to be effective: change is cheap. Changes that are fast to make can easily be accommodated within short development cycles. Not all changes are cheap, however. Cyber-physical systems, where hardware-software interfaces are dominant, offer prominent examples of systems in which hardware modifications or unanticipated deployment environments can lead to changes with long development cycles. Other types of change can be expensive in purely software systems, as well.
For example, poorly documented, tightly coupled, legacy code can rarely be successfully replaced in a single iteration. Incorporating new, third-party components or subsystems can involve lengthy evaluation, prototyping, and development cycles, especially when negotiations with vendors are involved. Likewise, architectural changes--for example, moving from a master-slave to a peer-to-peer deployment architecture to improve scalability--regularly require a fundamental and widespread re-design and refactoring that must be spread judiciously over several development iterations.
Evolutionary Design and Big Data Applications
As we described in a previous blog post, our research focuses on addressing the challenges of building highly scalable big data systems. In these systems, the requirements for extreme scalability, performance, and availability introduce complexities that require new design approaches from the software engineering community. Big data solutions must adopt highly distributed architectures with data collections that are (1) partitioned over many nodes in clusters to enhance scalability and (2) replicated to increase availability in the face of hardware and network failures. NoSQL distributed database architectures provide many of the capabilities that make scalability feasible at acceptable costs. They also introduce inherent complexities that force applications to perform the following tasks:
Due to the size of their deployment footprint, big data applications are often deployed on virtualized, cloud platforms. Clouds platforms are many and varied in their nature, but generally offer a set of services for application configuration, deployment, security, data management, monitoring, and billing for use of processor, disk, and network resources. A number of cloud platforms from service providers, such as Amazon Web Services and Heroku, as well as open-source systems, such as OpenStack and Eucalyptus, are available for deploying big data applications in the cloud.
In this context of big data applications deployed on cloud platforms, it's interesting to examine the notion of evolutionary system design in an iterative and incremental development project. Recall that evolutionary design is effective as long as change is cheap. Hence, are there elements of big data applications where change is unlikely to be a straightforward task, and that might, in turn, require major rework and perhaps even fundamental architecture changes for an application?
We posit that there are two main areas in big data applications where change is likely so expensive and complex that it warrants a judicious upfront architecture design effort. These two areas revolve around changes to data management and cloud deployment technologies:
Our Lightweight Evaluation and Architecture Prototyping for Big Data (LEAP4BD) method reduces the risks of needing to migrate to a new database management system by ensuring a thorough evaluation of the solution space is carried out in the minimum of time and with minimum effort. LEAP(4BD) provides a systematic approach for a project to select a NoSQL database that can satisfy its requirements. This approach is amenable to iterative and incremental design approaches, because it can be phased across one or more increments to suit the project's development tempo.
A key feature of LEAP(4BD) is its NoSQL database feature evaluation criteria. This ready-made set of criteria significantly speeds up a NoSQL database evaluation and acquisition effort. To this end, we have categorized the major characteristics of data management technologies based upon the following areas:
Within each of these categories, we have detailed evaluation criteria that can be used to differentiate big data technologies. For example, here's an extract from the Data Model evaluation criteria:
d. Text search in data items
In LEAP(4BD), we first work with the project team to identify features pertinent to the system under development. These features help identify a specific set of technologies that will best support the system. From there, we weight individual features according to system requirements and evaluate each candidate technology against these features.
LEAP(4BD) is supported by a knowledge base that stores the results of our evaluations and comparisons of different NoSQL databases. We have pre-populated the LEAP(4BD) knowledge base with evaluations of specific technologies (e.g., MongoDB, Cassandra, and Riak) with which we have extensive experience. Each evaluation of a new technology adds to this knowledge base, making evaluations more streamlined as the knowledge base grows. Overall, this approach provides a systematic, quantitative, and highly transparent approach that quickly provides a ranking of the various candidate technologies according to project requirements.
As we have demonstrated thus far in this series, there are many facets to LEAP(4BD). The next post in this series on Big Data will explain the prototyping phase. In the meantime, we're keen to hear from developers and architects who are evaluating big data technologies, so please feel free to share your thoughts in the comments section below.
To listen to the podcast, An Approach to Managing the Software Engineering Challenges of Big Data, please visit https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=294249