Lessons from the Amazon Xmas Eve Outage

On Christmas eve, Amazon Web Services (AWS) had a significant outage impacting the services of a number of customers, notably including Netflix. The timing was especially bad, with many Netflix customers gathered with family members for the day, expecting to watch movies. Unsurprisingly, the trigger to this unfortunate event was a maintenance activity gone awry. As I’ve noted previously on this blog, human error continues to be a leading cause of technology service outages. Back in April of 2011, Amazon had an extended outage in their EC2 service, also caused by human errors during the execution of change activities.

Let’s start with some kudos for Amazon. They continue to respond with transparency to service outages, providing reasonably detailed, technical postmortem reports. I  suspect that customers such as Netflix received even more detailed information, allowing them to understand all technical and procedural issues that led to the outage. Additionally, Amazon continues to show their focus on continuous improvement, with a detailing of process changes that they will implement.

Before dissecting the latest outage for lessons learned, I’ll issue a quick disclaimer. I don’t have access to detailed reports about the outage. I am also not an expert on Amazon Web Services’ technical architecture or operational processes. Therefore, some of my insights and recommendations, while based on this outage, will be theoretical in nature. That said, I am confident that my thoughts are based on sound principles, stemming from years of managing mission critical systems in large enterprise settings.

According to Amazon’s postmortem, the proximate cause of the outage was the accidental deletion of configuration information for AWS’ elastic load balancing (ELB) services. Theses services are an essential component of most web based architectures, allowing for the distribution of workload across a large, scalable set of hardware resources. As Amazon reported in their postmortem, the deletion was the result of a developer inadvertently running a maintenance process against the production environment. The report noted that the developer was “one of a very small number of developers who have access to this production environment.” It also noted that the developer’s access was extraordinary, only in place on an interim basis to allow for the  execution of operational processes that are not yet automated.

In examining this outage for lessons learned, there are two different “layers” to explore. First, there are the root cause issues and areas for improvement for AWS as a service provider. Separately, Netflix and other cloud services customers have their own set of lessons for architecting systems that leverage service provider infrastructure.

Starting with AWS, lesson one is a full recognition of the almighty power of human error. Despite the best recruiting, training and coaching, given enough opportunities, people will make errors. The number one principle of mistake proofing is to prevent these errors from actually impacting a customer facing service. And, while AWS recognized the need for specific changes in their operational processes, true mistake proofing happens at a deeper and more foundational level within an organization. That is, the heart of mistake proofing is having a set of “first principles” that are embedded in the culture of the organization.  It is only by incorporating these principles and the processes that flow from them, that errors of a broad variety can be defended against. By simply responding to the nuances of a particular incident, an organization creates process improvements that are too narrow in scope.

Many service issues are enabled by architectures that don’t fully segregate production and non-production (e.g development) environments. There are two basic classes of problems that emanate from this deficiency. First, as I believe happened in the Amazon case, an individual thinks they are performing an operation against the development instance and in fact impacts production. While the Amazon case involved the accidental deletion of essential data, other equally disastrous outcomes can result from the inadvertent rebooting of a production server or recycling of a production process. An equally devastating consequence of non-segregated environments occurs in the opposite direction. That is, a production operation or service, inadvertently is directed at a non-production environment. A painful example of this is when production applications mistakenly point back to non-production databases. If not caught quickly it can lead to a host of grave data integrity and customer service issues.

The technical steps that would ensure adequate segregation between production and non-production environments are complex, architecture specific, and beyond the scope of this blog post. At a minimum though, one would want to have separate id sets, authentication services and scheduling services. Firewalling the two environments is another sound practice. Dedicated proxy services should be used for the transfer of data between the environments. The goal should be to absolutely prevent any process operating within one environment to inadvertently access the resources of the other environment.

Another issue mentioned in the postmortem is that the developer access to production was temporary while maintenance tasks were being automated. This is a common situation that frequently leads to trouble down the road. Frequently, as organizations (service providers or enterprises) implement new platforms, applications or features to a production environment, they are time-pressured. Whether it is market pressures, or heat from internal customers, they feel an urgency to get a new capability up and running. Often, to accomplish this expediently, certain shortcuts are taken. A variety of important deliverables associated with the new capability are positioned as “Day 2” items. That is, they will completed at some point in time after the initial go-live date.

In addition to automation of maintenance routines, other typical “Day 2” deferrals include monitoring, backup, business continuity and documentation. Organizations such as start-ups who have minimal existing customer bases and are looking for first mover advantages can understandably make these compromises. They can afford to sacrifice availability for feature richness and innovation. But established organizations such as Amazon, who provide services to large numbers of big mission critical customers, should not be taking these shortcuts. The same is true for enterprises providing high value services to their internal and external customers. The advantages provided by the early release of these new capabilities is vastly outweighed by the negative impact of outages. High availability environments should have all proper operational controls in place prior to a production go-live event.

As an aside, another insidious problem with these Day 2 deferrals is that their remnants tend to remain in place indefinitely, creating ongoing risk. Organizations frequently lack the processes to ensure that an effective “clean up” is accomplished, eliminating the risk. Typically, an organization gets distracted with other priorities after the go-live event, leaving the deferred activities on a perpetual “back burner”. If the organization is “lucky”, the issue is discovered as part of an audit. Otherwise it is uncovered only when an aberrant event stumbles upon the exposure, leading to an outage, sometimes years down the road.

Another issue highlighted by this outage is the apparent complexity of designing fully resilient applications on top of AWS’ infrastructure. I would expect that Netflix, and other impacted customers, could have implemented their applications in such a way that the outage would not have taken down their service (or at least would have allowed it to run in a degraded fashion). However, this typically requires a deep understanding of an extensive set of resources, along with a full recognition of complex interrelationships of components. This is not easily done by customers. A better approach would be for AWS to create pre-packaged offerings with different levels of resiliency (e.g. Platinum, Gold, Silver) and associated service levels. Such an offering, should be easily leveraged by a customer, without requiring a specific understanding of the underlying architectural details.

Now let’s look at the customer side of this issue. Leveraging service providers such as AWS for “on demand” infrastructure is becoming a more common practice. As cloud based services continue to mature, firms such as Netflix see increasing benefits in investing in their own dedicated data centers, network infrastructure and server farms. All of this, of course, comes with tradeoffs and cautions. To appropriately leverage a service provider’s capabilities, a client firm must understand the architectural limitations of the services. While I’m not aware that AWS has created the simplified service tiers suggested above, they do have suggested design practices. As an example they utilize a concept of
“availability zones”  to enable client’s to implement greater resiliency within their systems.

It is beyond the scope of this post to outline the appropriate client implementation for each service provider and scenario. However, there is one simple rule to follow. Prior to implementing any mission critical applications on service provider infrastructure, engage in a design phase with the provider. Make sure they understand your specific service requirements and that your design, matched with their infrastructure, will meet those needs.

 

This entry was posted in Innovation, Process Improvment, Risk. Bookmark the permalink.