The philosopher George Santayana famously said, “Those who cannot remember the past are doomed to repeat it.” It’s a simple yet valuable idea. Learning from your mistakes, and the mistakes of others, can lead to better decisions, and improved outcomes in your personal and professional life. One professional discipline, commercial aviation, has taken this concept to levels far above any other field. Many other fields would benefit from applying the error reduction approaches pioneered by the commercial aviation industry.
In this post, I will examine three notable aviation disasters that led to dramatic improvements in safety. For each incident, I’ll look to apply lessons learned to my “home field” of Information Technology. In no way do I intend to trivialize these accidents. It is hard to compare an outage of an email system to the loss of life. I do feel strongly, however, that applying some of the error reduction ideas of commercial aviation could strongly benefit Information Technology along with numerous other disciplines.
Flying on large commercial airliners has become a remarkably safe manner of transportation. For United States-based carriers the past 10 years has been an extraordinary period of safe operation. From 2002-2011 there were a total of 9 fatal accidents out of nearly 102 million flights. These flights covered over 75 billion miles making the fatal accident rate 1 in almost 8 billion miles.
But it hasn’t always been this way. In the early days of commercial aviation, accident rates were significantly higher. However, the dramatic impact of fatal crashes led to a culture of continuous improvement and partnership between the carriers and government regulators. Each accident was viewed as a learning experience, with an opportunity to eliminate the possibility of a similar crash. Let’s take a look at five of these notable crashes.
1956 Grand Canyon Mid-Air Collision – Sweeping Improvements in Regulation
On a June morning over the Grand Canyon, a United Airlines DC-7 struck a TWA Lockheed L-1049 causing both planes to crash into the terrain below. Both planes had taken off from Los Angeles International Airport within 3 minutes of each other. They both entered overcast skies, headed in a similar direction. Due to the lack of black boxes, radar and cockpit voice recorders, we will never fully understand the exact details of this accident. However, the immature state of air traffic control at that time certainly appears to have played a major factor.
In the mid-1950’s it was common for planes to fly large sections of their journey without radar support from air traffic controllers. Pilots would rely on ground stations known as VOR’s to emanate signals to create a virtual path to follow through the sky. They would fly using visual flight rules (VFR), with a responsibility for seeing and avoiding other planes in their area. Due to a lack of a seamless network of radar, pilots would need to periodically report their positions back to air traffic controllers.
The crash drew the scrutiny of the media, the public and congress. The crash was the worst of a civilian plane in this burgeoning era of modern air travel. The public demanded answers and a congressional hearing was held in 1957. The aftermath of the Grand Canyon collision led to a number of seminal reforms including the formation of the Federal Aviation Administration and the establishment of on-board crash avoidance systems.
In the world of enterprise information technology, “collisions” result from the lack of mature governance processes for projects or changes. Without seamless oversight of the multitude of project and change related activities happening within an organization, conflicts can occur; stressing resources, adding execution risk and creating outages. A mature process of project and change governance is like our modern-day air traffic control system. Flight paths are understood in advance, project and change activities “see” each other, and conflicts (crashes) are seen in advance and avoided.
1954 BOAC Flight 781 -The Science of Accident Investigations
The next time you are on a plane, notice that the windows are shaped differently from those in your home or office. Unlike the classic rectangular glass of daily life, airplane windows are a rounded rectangle, without sharp corners. The crash of British Overseas Air 781 directly led to this design.
On a January day in 1954, a de Havilland Comet 1 took off from Rome, heading to London. While flying at cruising altitude it exploded over the Mediterranean, killing all 35 passengers and crew members. The task of determining the cause of this crash was complex.
Initially, only bodies were recovered from the accident scene. The plane itself remained missing. Autopsies showed a consistent pattern of fractured skulls and ruptured lungs in the victims, pointing to an explosive decompression of the airplane’s cabin.
But without the wreckage from the crash, investigators were stumped in determining the root cause of the crash. In an unprecedented move, Winston Churchill ordered the Royal Navy to recover the aircraft from the sea bed. For the time, this was an extraordinarily complex endeavor. After painstakingly searching a large area around the suspected crash site, the Navy discovered pieces of the wreckage.
Meanwhile, back in London, BOAC was under significant financial pressure as its fleet of Comet jets sat idle, grounded after the fatal crash. Ten weeks after the accident, the British government made the decision to allow the grounded Comets to fly again. It would prove to be a fatal call. A mere 16 days after resuming flights another Comet disappeared over the Mediterranean in what appeared to be similar circumstances. This time, however, the plane was lost in waters too deep for recovery of its wreckage. Fortunately, some bodies were recovered. Autopsies uncovered fractured skulls and ruptured lungs, just as with the first crash. There was now a strong sense that explosive decompression of the cabin was the cause of both crashes.
In order to validate this theory, a 1/10 scale model of the plane’s fuselage was built. Dummies were placed inside to simulate the passengers. The accident was simulated by creating pressure inside the cabin until it ruptured. Sure enough, the dummies were thrown from their seats, smashing their heads against the ceiling.
In the meantime, the investigation proceeded down two paths. First, the recovered wreckage from the first crash was painstakingly pieced together on a wooden frame. Second, an incredible experiment was performed to attempt to find any weakness in the fuselage of the Comet. A stripped down fuselage of a Comet was delivered to the investigators. They constructed a massive water tank and placed the fuselage within it. Then, the plane was pumped full of water to simulate the pressure encountered during a flight. After 5 minutes the pressure would be reduced. Less than a month after testing began (the equivalent of 3000 flights) the fuselage ruptures.
Combined with evidence from the reconstructed wreckage, it was now clear that there were design flaws in the thin aluminum exterior of the plane. One of the prominent flaws was the use of square windows which were vulnerable to failure at their sharp corners.
The investigation into the crash of BOAC represented an unprecedented level of resources, ingenuity and persistence. It ushered in an era of scientific investigations of accidents. It pioneered techniques such as wreckage recovery and reassembly and led to a better understanding of the importance of metallurgy in aircraft manufacture.
Often in the world of information technology, we are confronted with repeat problems, without a good sense of a root cause. As with the Comet investigation, it is sometimes necessary to use clever and resourceful methods to understand the causal factors. Examples of these techniques include:
- Insertion of debugging code
- Turning on enhanced diagnostics
- Simulating conditions in non-production environments
- Swapping components
Having a formalized set of practices around incident and problem management can give an organization an improved chance of rapidly determining root causes and preventing their recurrence.
1974 – Eastern Flight 212 – Sterile Cockpit Rule
In September of 1974, an Eastern Air Lines DC-9 attempted to land in dense fog on approach to Douglas Municipal Airport in Charlotte, North Carolina. The plane crashed approximately 3 miles short of the runway, killing 71 of its occupants.
An investigation conducted by the National Transportation Safety Board (NTSB) focused on conversations of the pilot and first officer recorded by the Cockpit Voice Recorder. The NTSB determined that the flight crew “engaged in conversations not pertinent to the operation of the aircraft” during the final 2 and 1/2 minutes of the flight prior to impact.
During that time, the crew should have been focused on checklists, callouts (verbal readings of current altitude) and final landing procedures. Instead they engaged in conversations ranging from politics to used cars to an amusement park they were currently flying past. During this critical phase, they failed to respond to low altitude alarms and seemed to be unaware of their dangerously low flying level.
As a result of the investigation, the Federal Aviation Administration (FAA) issued a mandate known as the Sterile Cockpit Rule. This rule requires flight crews to limit their conversations and activities to critical flight related activities anytime the aircraft is below 10,000 feet.
The health care industry has attempted to emulate the Sterile Cockpit Rule for some of their critical tasks. One area where this has concept has been applied involves the preparation of medications in Intensive Care Units (ICUs). Much like the critical runway approach phase of a flight, the accuracy of medication preparation can be a life and death issue. In one 2005 study, 78% of all serious adverse ICU events could be traced to medication preparation errors. Distractions during the preparation of medications is deemed to be a leading cause of these errors.
In one study, an ICU attempted to cut the rate of distraction by emulating the sterile cockpit rule. Areas where medication was prepared were deemed No Interruption Zones (NIZ), marked by red tape on the floor. All ICU personnel were trained to avoid engaging any nursing personnel when they were inside the NIZ. The result was a dramatic reduction of interruption of staff when they were engaged in medication preparation activities.
In the information technology field, there are numerous times when staff are performing critical operations such as crucial maintenance or repair/recovery operations. As with airplane landings or medication preparation, these tasks can be complicated, requiring focus and concentration. Staff could be trained to avoid any extraneous dialogue or activity when engaged in these tasks. As with the NIZ’s utilized in ICU’s, a designated quiet area could be used for high-risk, complex activities.
A disaster, or a simple problem, is an opportunity for an organization to learn and to make valuable process improvements. The commercial aviation industry has taken this concept to great maturity with a simple mantra; never allow a repeat occurrence of a serious event. Information technology organizations, and other knowledge-based disciplines would do well to emulate this idea.