Posted on by Architecturein
For more than 10 years, scientists, researchers, and engineers used the TeraGrid supercomputer network funded by the National Science Foundation (NSF) to conduct advanced computational science. The SEI has joined a partnership of 17 organizations and helped develop the successor to the TeraGrid called the Extreme Science and Engineering Discovery Environment (XSEDE). This posting, which is the first in a multi-part series, describes our work on XSEDE that allows researchers open access--directly from their desktops--to the suite of advanced computational tools and digital resources and services provided via XSEDE. This series is not so much concerned with supercomputers and supercomputing middleware, but rather with the nature of software engineering practice at the scale of the socio-technical ecosystem.
Background: Bringing Disciplined Engineering to TeraGrid
From 2001 to 2011 the NSF's network of supercomputers, services, and middleware--known as the TeraGrid--played a key role in establishing and supporting the emerging discipline of computational science. Computational scientists used the TeraGrid to study the atomic-level structure of protein needed for sound perception, investigate how gas giant formations like Jupiter and Saturn are formed, and gain new insights into the early stages of HIV-1 infection to prevent the virus' ability to "hijack" other healthy cells.
By 2008, however, it was becoming clear to NSF's Office of Cyberinfrastructure (OCI) that TeraGrid's success had generated new demands from computational scientists that could not be satisfied by the existing infrastructure. While TeraGrid had provided technologies and services to integrate the nation's most advanced supercomputing resources, by 2008 there was a growing perception among computational scientists that not only did the national cyberinfrastructure ecosystem need to provide access to supercomputing resources, there needed to be a easier way to integrate a much broader array of digital assets and services in conjunction with much greater stability and other quality attributes in order to support top-tier research programs.
In 2008, NSF/OCI issued a call for proposals for TeraGrid Phase III: eXtreme Digital (XD) Resources for Science and Engineering. As expected, the Phase III solicitation called for a continuation of and substantial improvements to core TeraGrid technologies, as well as to its outreach, education, consulting, and infrastructure-management services. The solicitation also included something surprising--a strong emphasis on software and system engineering in Phase III. NSF reinforced this new emphasis in a series of town hall meetings for potential submitters.
Among the takeaways from those meetings, one thing was certain: The winning proposal would need disciplined engineering processes. But which processes? It was this question that Pittsburgh Supercomputing Center director Michael Levine posed to the SEI. The SEI's foundational role in identifying "process" as a lever for improving practice--and in achieving industrial-scale adoption of software engineering process through the Capability Maturity Model Integration (CMMI) and its many derivatives--made it a natural choice for Levine and his team.
Levine, a key leader in TeraGrid, also understood that NSF's culture and the TeraGrid community in particular required something different than the conventional notion of "process." It was this understanding that ultimately led Levine to the SEI's Research, Technology, and Systems Solution (RTSS) program. After a few rounds of discussion, RTSS joined a team that included the National Center for Supercomputing Applications (NCSA), Pittsburgh Supercomputing Center (PSC), Texas Advanced Computing Center (TACC), and the National Institute for Computational Sciences (NICS). The team assembled by Towns at the NCSA ultimately developed the winning proposal. The successor to TeraGrid would now be known as the eXtreme Science and Engineering Discovery Environment (XSEDE).
Our Challenge: Creating an Engineering Culture
NSF decided to award one-year planning grants to the two most competitive preliminary proposals: XSEDE and XROADS.
NSF had awarded the planning grant with the stipulation that the results were to be shovel-ready implementation plans, ready on Day 1. That is, the winning proposal would need to go operational with no disruption of service to any current customer of TeraGrid. The XSEDE team recognized the following challenges of developing a credible management and technical plan:
The SEI team believed that the challenge was of a crosscutting nature: Project governance, software and system architecture, and engineering practice must be mutually reinforcing. Our first challenge was to persuade the XSEDE team to abandon the belief (as it turns out, a belief that was widely held by both XSEDE and XROADS teams) that "doing" software engineering was nothing more than following a process, and, by extension, the SEI could define "The Process."
To overturn this mistaken belief would require persistence. We undertook two efforts to do so:
As expected, the XSEDE team was mostly indifferent to the content of the SEMP, except to note that it lacked narrative "pizazz." They were surprised, however, by the effectiveness of techniques--such as the Mission Thread Workshops led by the SEI's Mike Gagliardi--at shedding light on the organizational implications of engineering decisions.
Most importantly, the team began to appreciate that the steps of "The Process" do not define most of what we regard as software engineering. Instead, real software engineering practices reside within these steps--the parts that often aren't written down.
Our New Challenge: Doing it Again, In Operation
With the one-year planning grant complete, in 2011 XSEDE and XROADS teams submitted and briefed their plans to the NSF Review Panel. When asked what worried him most about the review process, John Towns (XSEDE PI) replied he was concerned that the panel would decide to fund part of each of the XSEDE and XROADS efforts. That funding model would violate the by-now integrated management, technical, engineering and governance structures that XSEDE had developed in its plan.
As it turned out, Towns' fears were partly realized when NSF announced its intent to award XD to XSEDE with the proviso that it incorporate the most meritorious elements of the XROADS effort and team. We were now confronted with the need to revisit planning assumptions and decisions made with a full year of deliberations--in effect to "re-plan" the effort in the span of a few short weeks.
In hindsight it is clear that NSF made a wise decision in merging the XSEDE and XROADS proposals, both in terms of avoiding the schism in the community that might have arisen from a "winner-take-all" outcome, and in terms of incorporating several exciting architectural ideas introduced by XROADS, such as delivering cloud resources as an alternative to leadership class "big iron" computers, and adopting "software as a service" as a business model and product delivery channel.
On the other hand, much of what we had accomplished in terms of introducing the seeds of an engineering culture in XSEDE had been thoroughly disrupted. After a year of engaging the XSEDE team on the nuances of software engineering methods, these methods had begun to pay dividends. Of course this progress relied on the skills and perspectives of just one of the two teams with which we had engaged.
Now our engineering teams would be reconstituted with people from both teams. Not only had we potentially lost a "common sense" of software engineering, but we now had engineering teams with "competing views" of what to accomplish. We had to reconstitute an engineering culture while simultaneously realigning our technical objectives. Moreover, because the TeraGrid effort was coming to its contractual end, we needed to do this "live," which is like having to redesign and rebuild an aircraft while in flight.
To be fair, although we knew that the merger was going to be challenging, we also knew that we were substantially further along than we had been one year earlier. Both teams had spent their planning efforts wisely, and their overall approaches were more similar than different, and where different, they were mostly complementary. There was also something at work that is difficult to objectively quantify: The members of the XSEDE and XROADS teams were mostly well known to one another, and there was a universal desire to move beyond the competition of the past years, roll up our sleeves, and get to work on the real mission of advancing the nation's computational science.
Results after One Year
What did we manage to accomplish? Did we build a redesigned aircraft while in mid-flight without crashing and burning? After one year, we can safely claim a good measure of success, the results of which are summarized in the figure below.
Figure 1 shows a portion of the XSEDE program that is engaged in engineering work on a day-to-day basis, as described below:
It is fair to say that on Day 1 of the XSEDE/XROADS union, none of the practice areas reflected in the graphic and outlined above were in place, and indeed, few if any had any legacy in TeraGrid. This substantial progress is due not to the SEI but to XSEDE leadership and staff.
Our plans for Year 2 combine two main thrusts
Upcoming Topics in the XSEDE Thread
My next post will explore establishing an engineering culture within an emerging social-technical system. Other topics that I plan to post about are:
In addition to the above posts, I'd like to enlist the help of other contributors to the XSEDE effort, as well as researchers of engineering practices in and for socio-technical systems. Please feel free to leave your thoughts in the comments section below.
Due Credit Before Closing
While most of the credit belongs to XSEDE leadership and staff, I will take a few moments here to give credit to SEI colleagues who have contributed to this result:
For more information about the XSEDE project, please visit https://www.xsede.org/.
For more information about the SEI's work in architecture centric engineering (ACE), please visit
For more information about the SEI's work in system-of-systems engineering, please visit www.sei.cmu.edu/sos/.
For more information about the SEI's work in ultra-large-scale systems, please visit www.sei.cmu.edu/uls/.
Visit the SEI Digital Library for other publications by Kurt.