Posted on by Big Datain
By John Klein
Senior Member of the Technical Staff
Architecture Practices Initiative
Have you ever been developing or acquiring a system and said to yourself, I can't be the first architect to design this type of system. How can I tap into the architecture knowledge that already exists in this domain? If so, you might be looking for a reference architecture. A reference architecture describes a family of similar systems and standardizes nomenclature, defines key solution elements and relationships among them, collects relevant solution patterns, and provides a framework to classify and compare. This blog posting, which is excerpted from the paper, A Reference Architecture for Big Data Systems in the National Security Domain, describes our work developing and applying a reference architecture for big data systems.
Last year, I worked with architects at the Data to Decisions Cooperative Research Centre to define a reference architecture for big data systems used in the national security domain. Acquirers, system builders, and other stakeholders of big data systems can use this reference architecture to
Scoping the Target Domain
We began by scoping the target domain. A reference architecture defines a family of related systems, and we know from our work in software product lines that scoping the target domain is a key to success. In particular, if your scope is too broad, the information in the reference architecture will be too general to be useful. If the scope is too narrow, however, the information will resemble the description of a single system and will not be easy for others to reuse. We scoped our reference architecture by defining a set of four use cases across a range of missions:
From these use cases, we identified categories of requirements that were relevant to big data systems. These categories included data types (e.g., unstructured text, geospatial, and audio), data transformations (e.g., clustering, correlation), queries (e.g., graph traversal, geospatial), visualizations (e.g., image and overlay, network), and deployment topologies (e.g., sensor-local processing, private cloud, and mobile clients).
Our stakeholders had extensive experience developing and operating large-scale IT systems but needed help with the unique challenges arising from the volume, variety, and velocity of data in big data systems. We kept asking ourselves, Is this type of requirement different in a big data system? If so, how is it different? What might a newcomer to the domain miss? This analysis allowed us to reduce the background noise in the reference-architecture description, making the communication more effective.
We organized the reference architecture as a collection of modules that decompose the solution into elements that realize functions or capabilities and that relate to a cohesive set of concerns. Concerns are addressed by solution patterns (such as using the well-known pipes-and-filters pattern to process an unbounded data stream) or by strategies (which are design approaches that are less prescriptive than solution patterns, e.g., minimizing data transformations during the collection process). Together, modules and concerns define a solution-domain lexicon, and the discussion of each concern relates problem-space terminology (origin of the concern) to the solution terminology (patterns and strategies). Graphically, the model looks like this:
We found four types of concerns:
As noted above, we intended for this reference architecture to supplement other sources of general architecture knowledge. For example, while usability is obviously a concern in any human-computer interface, we did not specifically identify it as a concern in the reference architecture. In a big data system, however, providing an indication of data confidence (e.g., from a statistical estimate, provenance metadata, or heuristic) in the user interface affects usability, and we identified this as a concern for the Visualization module in the reference architecture.
Structuring the Reference Architecture
We used the four types of concerns described above to decompose a big data system into 13 modules grouped into three module categories:
In addition to the module decomposition, the reference architecture contained two supplemental sections to help our stakeholders apply the information. The first was a mapping that related COTS and open-source packages to the modules in the reference architecture. This simple tabular mapping allows a stakeholder to quickly understand how these technologies fit into the architecture--which solution capabilities each provides and how its use would affect the architecture of a system. A separate volume of the reference architecture maintains the mapping as it is the most dynamic and least normative prescriptive content..
We also returned to the use cases used to scope the reference architecture. For each use case, we showed how to use the reference architecture to design the architecture of a concrete system to realize the specified capabilities. In addition to providing a tutorial for our stakeholders, these examples served as an evaluation of the reference architecture contents and presentation.
If you are responsible for developing, integrating, or modernizing a number of systems that all deliver similar capabilities within a domain, creating a reference architecture can provide a framework for comparing, combining, and reusing solution elements.
Wrapping Up and Looking Ahead
This post (and our paper) describe a reference architecture for big data systems in the national security application domain, including the principles used to organize the architecture decomposition. This reference architecture serves as a knowledge capture and transfer mechanism, containing both domain knowledge (such as use cases) and solution knowledge (such as mapping to concrete technologies). We have also shown how the reference architecture can be used to define architectures for big data systems in our domain.
In the future, we would like to focus on the following areas of work:
We welcome your feedback on this work in the comments section below.
Read the paper that this blog post was based on, A Reference Architecture for Big Data Systems in the National Security Domain, which I co-authored with Ross Buglak, David Blockow, Troy Wuttke, and Brenton Cooper.
View my presentation Runtime Assurance for Big Data Systems.
Read the paper that I coauthored with Ian Gorton Distribution, Data, Deployment: Software Architecture Convergence in Big Data Systems.