DevOps Case Study: Amazon AWS
Regular readers of this blog will recognize a recurring theme in this series: DevOps is fundamentally about reinforcing desired quality attributes through carefully constructed organizational process, communication, and workflow. When teaching software engineering to graduate students in Carnegie Mellon University's Heinz College, I often spend time discussing well known tech companies and their techniques for managing software engineering and sustainment. These discussions serve as valuable real-world examples for software engineering approaches and associated outcomes, and can serve as excellent case studies for DevOps practitioners. This posting will discuss one of my favorite real-world DevOps case studies: Amazon.
Amazon is one of the most prolific tech companies today. Amazon transformed itself in 2006 from an online retailer to a tech giant and pioneer in the cloud space with the release of Amazon Web Services (AWS), a widely used on-demand Infrastructure as a Service (IaaS) offering. Amazon accepted a lot of risk with AWS. By developing one of the first massive public cloud services, they accepted that many of the challenges would be unknown, and many of the solutions unproven. To learn from Amazon's success we need to ask the right questions. What steps did Amazon take to minimize this inherently risky venture? How did Amazon engineers define their process to ensure quality?
Luckily, some insight into these questions was made available when Google engineer Steve Yegge (a former Amazon engineer) accidentally made public an internal memo outlining his impression of Google's failings (and Amazon's successes) at platform engineering. This memo (which Yegge has specifically allowed to remain online) outlines a specific decision that illustrates CEO Jeff Bezos's understanding of the underlying tenets of what we now call DevOps, as well as his dedication to what I will claim are the primary quality attributes of the AWS platform: interoperability, availability, reliability, and security. According to Yegge, Jeff Bezos issued a mandate during the early development of the AWS platform, that stated, in Yegge's words:
- All teams will henceforth expose their data and functionality through service interfaces.
- Teams must communicate with each other through these interfaces.
- There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team's data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
- It doesn't matter what technology they use. HTTP, Corba, Pubsub, custom protocols -- doesn't matter. Bezos doesn't care.
- All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.
- Anyone who doesn't do this will be fired.
Aside from the harsh presentation, take note of what is being done here. Engineering processes are being changed; that is, engineers at Amazon now must develop web service APIs to share all data internally across the entire organization. This change is specifically designed to incentivize engineers to build for the desired level of quality. Teams will be required to build usable APIs, or they will receive complaints from other teams needing to access their data. Availability and reliability will be enforced in the same fashion. As more completely unrelated teams need to share data, APIs will be secured as a means of protecting data, reducing resource usage, auditing, and restricting access from untrusted internal clients. Keep in mind that this mandate was to all teams, not just development teams. Marketing wants some data you have collected on user statistics from the web site? Then marketing has to find a developer and use your API. You can quickly see how this created a wide array of users, use cases, user types, and scenarios of use for every team exposing any data within Amazon.
DevOps teaches us to create a process that enforces our desired quality attributes, such as requiring automated deployment of our software to succeed before the continuous integration build can be considered successful. In effect, this scenario from Amazon is an authoritarian version of DevOps thinking. By enforcing a rigorous requirement of eating (and serving!) their own dogfood to all teams within Amazon, Bezos's engineering operation ensures that through constant and rigorous use, their APIs would become mature, robust, and hardened.
These API improvements happened organically at Amazon, without the need to issue micromanaging commands such as "All APIs within Amazon must introduce rate limit X and scale to Y concurrent requests," because teams were incentivized to continually improve their APIs to make their own working lives easier. When AWS was released a few years later, many of these same APIs comprised the public interface of the AWS platform, which was remarkably comprehensive and stable at release. This level of quality at release directly served business goals by contributing to the early adoption rates and steady increase in popularity of AWS, a platform that provided users with a comprehensive suite of powerful capabilities and immediate comfort and confidence in a stable, mature service.
Every two weeks, the SEI will publish a new blog post offering guidelines and practical advice to organizations seeking to adopt DevOps in practice. We welcome your feedback on this series, as well as suggestions for future content. Please leave feedback in the comments section below.
To listen to the podcast, DevOps--Transform Development and Operations for Fast, Secure Deployments, featuring Gene Kim and Julia Allen, please visit https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=58525.