Amazon released a postmortem report on their recent high profile outage. In this report, they explain in technical detail how their Elastic Compute Cloud (“EC2”) was knocked out of service for several hours, impacting numerous customers. Amazon has received significant praise for the openness and thoroughness of the report. At the same time though, the incident has amped up the debate around the risks of cloud computing. I don’t feel I have enough information to personally assess Amazon’s handling of the incident. They are operating an extremely large and complicated environment. Time will tell if they make sufficient adjustments to their architecture and processes to avoid recurrences of this type of problem. I do believe that we know enough about the outage to derive some general lessons from it.
Here are some clear facts that are revealed by the postmortem:
- A routine maintenance activity (related to growth) set the stage for the outage
- The maintenance activity was performed at 9:47 AM local time for the US East Region on a Thursday
- A step within the maintenance activity was performed incorrectly, triggering the outage
- Recovery of the system did not immediately work as anticipated
- Once the system got into an errant state, a cascading set of additional problems occurred
It is notable that despite Amazon’s size, sophistication and advanced technology, the story-line around this outage is as old as enterprise technology itself. Maintenance activities have long been the largest driver of outages. Human error is the predominant reason that maintenance activities go awry. Given these historical facts, I will analyze the Amazon outage from the standpoint of change management practices.
The first thing to examine about this activity is the scheduling of the change. The change took effect at the start of a business day for the east coast of the US. The central tenet of change management practice is to schedule activities when they can do the least harm if they are unsuccessful. When manual human effort is involved there must always be an assumption that there could be an error in the process. Furthermore, it must be expected that the error may lead to customer facing outages to the service. Historically enterprises scheduled changes during periods of low utilization or outside of their normal “business hours”. For many firms this traditionally meant Saturday night into Sunday morning for the local time zone of their primary customer base. In the case of Amazon, their scheduling may have been complicated by the very large, diverse community that they serve. Most companies, however, can still move change activity to lower impact windows where an outage would be more tolerable by customers.
The next thing to consider about the maintenance activity was the manual, human effort involved. The report indicates that an error was made when traffic was incorrectly shifted to a lower capacity network. While the details are not clearly spelled out, one could assume that this action was performed by a technician utilizing a command line or a graphically based management system. In either case, it was only a matter of time (i.e. enough change attempts) before the operator would make this, or a similar error. Basic mistake proofing doctrine would minimize the direct human involvement in maintenance activities. Instead the following best practices should be followed, in priority order:
- Where possible, the maintenance should be incorporated within management software specifically designed to perform the operation
- If that’s not possible the activity should be scripted, with proper dependencies and error checking included
- If human interaction is still required, the operator should follow a checklist, with forcing functions employed to ensure proper sequencing and execution
The last part of the incident to consider is the unexpected recovery issues and cascading failures that resulted from the mistaken operation. While every scenario can’t be fully tested, firms should do their best to simulate systems recovery processes under stress. A contributing issue in the Amazon incident was that the systems were attempting to recover under high load. Once again, scheduling the maintenance activity for a low utilization timeframe would be beneficial by providing less stress on recovery processes.
The current business climate mandates constant improvements, continuous growth and round-the-clock availability. These requirements put tremendous pressure on firms to remain nimble while retaining quality. Having a disciplined and mature approach to change management is an essential practice for firms that want to minimize service disruptions. The Amazon outage is a painful reminder of the potential for reputational damage when maintenance processes go awry.