Dan Bricklin's Web Site: www.bricklin.com
Learning From Accidents and a Terrorist Attack
There are principles that may be gleaned by looking at Normal Accident Theory and the 9/11 Commission Report that are helpful for software development.
In my essay "Software That Lasts 200 Years" I list some needs for Societal Infrastructure Software. I point out generally that we can learn from other areas of engineering. I want to be more explicit and to list some principles to follow and examples from which to learn. To that end, I have been looking at fields other than software development to find material that may give us some guidance.
Part of my research has taken me to the study of major accidents and catastrophic situations involving societal infrastructure. I think these areas are fertile for learning about dealing with foreseen and unforeseen situations that stress a system. We can see what helps and what doesn't. In particular, I want to address the type of situations covered in Charles Perrow's "Normal Accidents" (such as at the Three Mile Island nuclear power plant as well as airline safety and nuclear defense) and "The 9/11 Commission Report" (with regard to activities during the hijackings and rescue efforts).
Normal Accident Theory
Charles Perrow's book "Normal Accidents" was originally published in 1984 with an afterward added in 1999. It grew out of an examination of reports about accidents at nuclear power plants, initially driven by the famous major one that occurred on March 28, 1979, at the Three Mile Island nuclear plant in Pennsylvania. Perrow describes many different systems and accidents, some in great detail, including petrochemical plants, aircraft and air traffic control, marine transportation, dams, mines, spacecraft, weapons systems, and DNA research.
Perrow starts the book with a detailed description of the Three Mile Island accident, taking up 15 pages to cover it step by step. You see the reality of component failure, systems that interact in unexpected ways, and the confusion of the operators.
To help you get the flavor of what goes on during the accidents he covers, here is a summary of the Three Mile Island accident as I understand it:
Apparently a common failure of a seal caused moisture to get into the instrument air system which caused a change in pressure in the air system. The change in pressure triggered an automatic safety system on some valves to incorrectly think some pumps should be shut down, stopping the flow of water to a steam generator. That caused some turbines to automatically shut down. Stopping the turbines made an emergency pump need to come on which pumped water into pipes with valves that had accidentally been left closed during maintenance. The pipe valves had two indicators, but one was obscured by a repair tag hanging on the switch above it and they didn't check the other assuming all was well. When things started acting funny several minutes later, they checked but by then the steam generator boiled dry, causing heat not to be removed from the reactor core, which caused the control rods to drop in, stopping the reactor, but the reactor still generated enough heat to continue raising pressure in the vessel. The automatic safety device there, a relief valve, failed to reseat itself after relieving the pressure, letting core coolant be pushed out into a drain tank. The indicator for the relief valve failed, indicating that it was closed when it was not, so the draining continued for a long time without the operators knowing it was happening. Turning on some other pumps to fix a drop in pressure seemed to work for only a while, so they turned on another emergency source of water for the core, but only for a short time to avoid complications that it could cause if overused. Not knowing of the continued draining, the drop in reactor pressure didn't seem to match the increase in another gauge, and they had to choose which was giving a correct indication of what was going on. They were trying to figure out the level of coolant in the reactor (since too little would lead to a meltdown), but there were no direct measures of coolant level in this type of reactor, only indirect. The indicators that could indirectly help one figure out what was going on weren't behaving as they were trained to expect. Some pumps started thumping and shaking and were shut down. The computer printing out status messages got far behind before they found out about the unseated valve. The alarms were ringing, but you couldn't shut down the noise without also shutting down other indicators.
The story goes on and on -- this is just the beginning of that accident.
In addition to describing the many accidents and near accidents where good luck (or lack of bad luck) kept things safe, he also tries to figure out what makes some systems less prone to major accidents than others. It is, he believes, in the overall design.
Here are some quotes (emphasis added):
The main point of the book is to see...human constructions as systems, not as collections of individuals or representatives of ideologies. ...[T]he theme has been that it is the way the parts fit together, interact, that is important. The dangerous accidents lie in the system, not in the components. [Page 351]
...[Here is t]he major thesis of this book: systems that transform potentially explosive or toxic raw materials or that exist in hostile environments appear to require designs that entail a great many interactions which are not visible and in expected production sequence. Since nothing is perfect -- neither designs, equipment, operating procedures, operators, materials, and supplies, nor the environment -- there will be failures. If the complex interactions defeat designed-in safety devices or go around them, there will be failures that are unexpected and incomprehensible. If the system is also tightly coupled, leaving little time for recovery from failure, little slack in resources or fortuitous safety devices, then the failure cannot be limited to parts or units, but will bring down subsystems or systems. These accidents then are caused initially by component failures, but become accidents rather than incidents because of the nature of the system itself; they are system accidents, and are inevitable, or "normal" for these systems. [Page 330]
This theory of two or more failures coming together in unexpected ways and defeating safety devices, cascading through coupling of sub-systems into a system failure, is "Normal Accident Theory" [page 356-357]. The role of the design of a system comes up over and over again. The more that sub-systems are tightly coupled the more accident prone they will be. The most problematic are couplings that are not that obvious to the original designers, such as physical proximity that couples sub-systems. During a failure of one system (for example, a leak), a different system (the one it drips onto) is affected leading to an accident. (In computer systems this is very common, such as memory overruns in one area causing errors elsewhere.)
Another key point I found in the book is that in order to keep failures from growing into accidents the more an operator knows about what is happening in the system the better. Another point is that independent redundancy can be very helpful. However, back to coupling, redundancy and components that are interconnected in unexpected ways can lead to mysterious behavior, or incorrectly perceived correct behavior.
More examples from "Normal Accidents"
An example he gives of independent redundant systems providing operators much information was of the early warning systems for incoming missiles in North America at the time (early 1980's). He describes the false alarms, several every day, most of which are dismissed quickly. When an alarm comes in to a command center, a telephone conference is started with duty officers at other command centers. If it looks serious (as it does every few days), higher level officials are added to the conference call. If it still looks real, then a third level conference is started, including the president of the U.S. (which hadn't happened so far at the time). The false alarms are usually from weather or birds that look to satellites or other sensors like a missile launch. By checking with other sensors that use independent technology or inputs, such as radar, they can see the lack of confirmation. They also look to intelligence of what the Soviets are doing (though the Soviets may be reacting to similar false alarms themselves or to their surveillance of the U.S.).
In one false alarm in November of 1979, many of the monitors reported what looked exactly like a massive Soviet attack. While they were checking it out, ten tactical fighters were sent aloft and U.S. missiles were put on low-level alert. It turned out that a training tape on an auxiliary system found its way into the real system. The alarm was suspected of being false in two minutes, but was certified false after six (preparing a counter strike takes ten minutes for a submarine-launched attack). In another false alarm, test messages had a bit stuck in the 1 position due to a hardware failure, indicating 2 missiles instead of zero. There was no loopback to help detect the error.
The examples relating to marine accidents are some of the most surprising and instructive. It seems that ships sometimes inexplicably turn suddenly and crash into each other much more than you would think. He sees this as relating to an organizational problem along with the tendency of people to decide upon a model of what is going on and then interpret information afterwards in that light. In ships, at the time, the captain had absolute authority and the rest of the crew usually just followed orders. (This is different than in airplanes where the co-pilot is free to question the pilot and there is air traffic control watching and in 2-way radio contact.)
In one case Perrow relates that a ship captain saw a different number of lights on another ship than the first mate. They didn't compare notes about the number of lights, especially after the captain indicated he had seen a ship. The captain thought the ship was traveling in the same direction as them (two lights), while the first mate correctly thought that it was coming at them (three). Misinterpreting what he was seeing, the captain thought it was getting closer because it was a slow, small fishing vessel, not because it was big and traveling towards him. Since passing is routine, they weren't contacted by the other ship. When he got close, he steered as if he was passing, and turned into the path of the oncoming vessel, killing eleven on his boat. In another case, apparent radar equipment errors made a ship think an oncoming ship was to its left when it was really to its right. Fog came in, and mid-course maneuvers by both ships were increasingly in a feedback loop that caused a collision.
Another book about failures
To get a feeling for how common and varied failures are, and to see how some people have attempted to classify them for learning, there are books such as Trevor Kletz's "What Went Wrong? Case Histories of Process Plant Disasters" which chronicles hundreds of them. Examining many failures and classifying them for learning is very important. You don't want to just prevent an exact duplicate of a failure in the future, but rather the entire class it represents. Failures of parts and procedures are the common, normal situation. Everything working as planned is not. Safety systems are no panacea. "Better" training or people, while often helpful, won't stop it all.
Like Perrow, Kletz believes that design is critical for preventing (or minimizing the effect of) accidents. Here are some guidelines he discusses that relate to process plants:
Use processes with fewer dangerous intermediate products, and store as little dangerous product as possible. In Bhopal (where over 2,000 people were killed by a leaking chemical) "...the material that leaked was not a product or raw material but an intermediate, and while it was convenient to store it, it was not essential to do so." "What you don't have can't leak." [page 369]
Make incorrect assembly impossible, so that, for example, you can't put a pump in backwards.
Try to minimize designs that use items that are easy to damage during installation, such as expansion joints.
"Make the status of equipment clear. Thus, figure-8 plates are better than slip plates, as the position of the former is obvious at a glance, and valves with rising spindles are better than valves in which the spindle does not rise. Ball valves are [user] friendly if the handles cannot be replaced in the wrong position." [page 378]
"Use equipment that can tolerate a degree of misuse." [page 378]
It is crucial that reports of encountered problems be made available to others for learning, especial those problems that result in accidents. There are many reasons for this. Kletz lists a few (on page 396) including our moral responsibility to prevent accidents if we can, and the fact that accident reports seem to have a greater impact on those reading them than just reciting principles.
The 9/11 Commission Report -- a story of reaction to a forced change in a system
After reading "Normal Accidents", and with its lessons in mind, I read the sections in "The 9/11 Commission Report" that relate to the events during the hijacking of the planes and at the World Trade Center until the second tower collapsed. I was looking to learn about how the "system" responded once a failure (the hijacking and the buildings being struck) started. I was looking most toward finding descriptions of communications, decision making, and the role of the general populace because of my interest in those areas. I looked for uses of redundancy and communications and real-time information by the "operators" (those closest to what was happening). I looked for unplanned activities. I mainly dealt with the descriptions of what happened, not with the recommendations in the report.
Why look at terrorism? It is different than normal failures.
Terrorism is an extreme form of "failure" and accident. The perpetrators and planners look for weak components in a system and try to cause unexpected failures with maximum destruction and impact. Many traditional books on engineering failure, such as Perrow's "Normal Accidents", explicitly do not tackle it.
I see terrorism (for our purposes here) as being a form of change to a working system, with often purposeful, forced close-coupling, and that has bad effects. We can learn from it about dealing with changes to a system that must be dealt with and that were not foreseen by the original designers. It is like a change in environment or a change in the system configuration.
The entire report is available online for free and in printed form for a nominal fee. I have put together excerpts that I found in the Commission Report that I think are instructive. They are on a separate page, with anchors on each excerpt so that they can be referred to. The page is "Some Excerpts From the 9/11 Commission Report".
I think it is worth reading the actual, complete chapters, but in lieu of that, anybody interested in communications or dealing with disastrous situations like this should at least read the excerpts. I found it fascinating, horrifying, sad, and very real. As an engineer, I saw information from which to learn and then build systems that will better serve the needs of society. Such systems would be helpful in many trying situations, natural and man-made, foreseen and unforeseen, and could save lives and suffering. Let us learn from a bad situation to help society in the future. As Kletz points out, as engineers and designers, it is our duty.
Some key quotes from the 9/11 Commission Report:
...[T]he passengers and flight crew [of Flight 93] began a series of calls from GTE airphones and cellular phones. These calls between family, friends, and colleagues took place until the end of the flight and provided those on the ground with firsthand accounts. They enabled the passengers to gain critical information, including the news that two aircraft had slammed into the World Trade Center. [page 12]
The defense of U.S. airspace on 9/11 was not conducted in accord with preexisting training and protocols. It was improvised by civilians who had never handled a hijacked aircraft that attempted to disappear, and by a military unprepared for the transformation of commercial aircraft into weapons of mass destruction. [page 31]
General David Wherley-the commander of the 113th Wing [of the District of Columbia Air National Guard at Andrews Air Force Base in Maryland]-reached out to the Secret Service after hearing secondhand reports that it wanted fighters airborne. A Secret Service agent had a phone in each ear, one connected to Wherley and the other to a fellow agent at the White House, relaying instructions that the White House agent said he was getting from the Vice President. [page 44]
We are sure that the nation owes a debt to the passengers of United 93.Their actions saved the lives of countless others, and may have saved either the Capitol or the White House from destruction. [page 45]
According to another chief present, "People watching on TV certainly had more knowledge of what was happening a hundred floors above us than we did in the lobby.... [W]ithout critical information coming in . . . it's very difficult to make informed, critical decisions[.]" [page 298]
[Quoting a report about the Pentagon disaster:] "Almost all aspects of communications continue to be problematic, from initial notification to tactical operations. Cellular telephones were of little value.... Radio channels were initially oversaturated.. . . Pagers seemed to be the most reliable means of notification when available and used, but most firefighters are not issued pagers." [page 315]
The "first" first responders on 9/11, as in most catastrophes, were private-sector civilians. Because 85 percent of our nation's critical infrastructure is controlled not by government but by the private sector, private-sector civilians are likely to be the first responders in any future catastrophes. [page 317]
The NYPD's 911 operators and FDNY dispatch were not adequately integrated into the emergency response... In planning for future disasters, it is important to integrate those taking 911 calls into the emergency response team and to involve them in providing up-to-date information and assistance to the public. [page 318]
The Report strongly suggests that the billions of dollars spent on military infrastructure failed to stop any of the hijacked planes from hitting their targets. It was civilians, using everyday airphones and the unreliable cellular system, together with our civilian news gathering and disseminating system, and intuition and improvisation, that probably stopped one. Courage and bravery were shown by all, civilians and official personnel.
I thought the Report, in its analysis, paid too little attention to the important role of civilians and professionals acting out of their prepared roles. There is a lack of attention to societal communications, including TV, radio, Internet, cellular (voice, GPS, cell cameras, etc.), and too much just on those specific to officials. TV news was a crucial source for all, including the highest levels of government. While the phone network bogged down, it did provide crucial help, and civilian non-PSTN systems, such as Nextel Direct Connect, the Internet, and message-based systems did work well. Even the President suffered from a version of what we all get when traveling with wireless. "[H]e was frustrated with the poor communications that morning. He could not reach key officials, including Secretary Rumsfeld, for a period of time. The line to the White House shelter conference room-and the Vice President-kept cutting off." [page 40] The Vice President learned of the first crash from an assistant who told him to turn on his television on which he then saw the second crash. [page 35]
There are other examples of the general populace being an important component of what is usually thought of as being the province of "law enforcement". The AMBER Alert system is apparently working, as is the "America's Most Wanted" TV show, both of which use the general populace as a means for information gathering in response to detailed descriptions and requests. In Israel, the general populace has been instrumental in detecting suspicious behavior and even taking action to thwart or minimize terrorist attacks. According to the 9/11 Commission Report, a civilian passenger with years of experience in Israel apparently tried unsuccessfully to stop the hijackers on AA Flight 11. The fourth hijacked plane, UA 93, was stopped by civilians. An almost-catastrophe on an airplane was thwarted by attendants and passengers on AA Flight 63 when they restrained "shoe bomber" Richard Reid, now knowing that suicide terrorism on airplanes was a possibility.
What do we learn here with respect to reaction to disasters?
Disasters like this and many forms of terrorism are characterized by unforeseen situations and the inclusion of everyday people.
There is lots of happenstance going on, some of it good and some bad.
Improvisation is used and often needed in unanticipated places. The coupling of systems may be changed by the terrorists through their actions and through the effects of those actions, changing the nature of the entire system.
Procedures help if appropriate to the situation but may not if the situation is different than anticipated.
People close to specific situations throughout the emergency need to decide what to do based upon information.
The information needed is often available somewhere in the "system", if only it could be found.
Regular people can use the information and are a crucial component (e.g., UA 93).
Official people also use the information, and can be very helpful in additional ways, especially since they have equipment and training and "authority" to lead and comfort.
A sub-optimal response comes from lack of information at the right place in the situation as it unfolds (e.g., 911 operators not having timely evacuation instructions).
People often will ask the right questions, or at least some do, and then share what they learn and decide with others.
Outsiders will join if needed and asked (and sometimes will try to join even if not asked), and will need information. For example, General Wherley, commander of the 113th Wing, "...reached out to the Secret Service after hearing secondhand reports that it wanted fighters airborne." Many first responders from non-assigned groups as well as off-duty personnel showed up in New York and at the Pentagon with both good and bad results.
Multiple modes of communications and sources of information are necessary and help. Redundancy is good.
It is hard to know in advance all of the people and groups that will need to be connected through communications.
The Secret Service, an organization whose mission involves working against the unexpected, shows up in places you wouldn't expect, such as air defense and even providing information at the World Trade Center. This shouldn't be surprising. In addition to planning and post-event analysis, they specialize in improvisation, and are trained to be "...prepared to respond to any eventuality".
Examination of the 9/11 Commission Report comes up with some of the same lessons as Perrow, namely the need for people nearest to what's happening to have access to detailed, real-time information relating to what is happening in many parts of the system that may not have been foreseen.
Here are some additional things that we learn about cases like this:
There is a need for the ability to easily improvise to deal with new situations.
The people involved in dealing with the situation now include the general populace instead of just "operators".
Some of the information is coming from more generic sources as well as those participants themselves.
The general populace, and even the "official" participants, are likely to seek information from everyday channels (TV, 911, cell phones, Internet).
What we can do
Here are some of my thoughts about reacting to catastrophes:
I see a need for a source of coordination of information from afar. Being too close to the situation may cause those coordinators that are close to lose the wide perspective and get sucked into a growing situation. Multiple, independently redundant means of information acquisition, evaluation, and dissemination are important.
We need ways for everyday people involved in a disastrous situation to both give and receive information. We need ways that they can query quickly for what they need. Telephone 911 has served some of this, but it is apparently not good in an unusual, widespread, unforeseen situation. The Internet could be used, and could serve as a resource to those 911 operators, too. In today's world, they pull out their cell phones and call loved ones who look at the TV and could serve as gateways for sending and receiving information.
The work that is going on in the blogging world in regard to seeking out, filtering, and disseminating information coming from a huge number of data sources for a diverse set of needs may be helpful for learning how to deal with such situations. Blogs, RSS, and the search engines are currently tuned for situations that unfold over many hours, days, or weeks. We need to look for principles that can be applied to minutes and seconds. The media forms of text, images, audio, and video are appropriate here, too.
We have a populace that is getting more and more comfortable with message-based forms of communications. Much of the information needed during 9/11 was relatively static: The fact that planes had been hijacked, that planes had hit buildings, where exactly they hit, that an evacuation had been called for, etc. Such information was of life and death importance. Even short text messages would have sufficed in some cases (though they shouldn't be the only means of communications, of course).
Communications devices used by official first responders (and hopefully everyday people) need to handle message-based communications better. Stored messages in voice may be the best for some users. (I don't think we want to depend upon fire fighters reading tiny screens surrounded by smoke nor type on keyboards while wearing heavy gloves.)
Communications devices need to handle degrading connectivity situations better. For example, they should be able to move from 2-way voice to real-time, during-the-call store-and-forward voice, to delayed voice (like auto-delivered voicemail), and even text delivered from speech-to-text (as text and perhaps as text-to-speech). This would handle intermittent connectivity and saturated bandwidth gracefully with little change in operation or training. Drop-outs and degradation should be flagged, perhaps with an appropriate sound indicating that there was a possible loss of information. Requests for retries should be possible, perhaps built into the system so the sender doesn't need to repeat themselves but rather an ongoing real-time recording would be used (much as we playback voicemail) both on the sender and receiver side. Storage requirements and processing power for such functions are well within the capabilities of multimedia-enabled handsets. User interfaces and standards need to be developed. Perhaps this is an area where new generic handsets and WiFi-enabled VoIP will take a leap forward.
The ability to switch communications seamlessly or manually between many different types of carriers, be it normal Internet connections, handheld point-to-point radios, or some sort of mesh system could be key. Rather than just pour more money into "beefing up" single agency-specific systems, find out how to take advantage of multiple systems, including civilian ones. Cell phones may have been overwhelmed, but they saved lives, people got through some of the time, and they were used by even the senior officials. Redundancy and the ability to improvise must be exploited. Packet-based communications can have advantages over circuit-based in terms of graceful degradation.
Traditionally, unusual situations are handled with "situation specific conference calls" where interested parties share an open "line" and participate or just listen in. During 9/11 there were several and the right people were not always present. We are learning about "joining" multiple simultaneous conversations with online "chat" as one style and RSS aggregators as another. We need to move this further.
I can imagine many "regular" people reacting to an emergency need for information by turning on the TV and/or radio, using their phone or cell phone, and using Google or their preferred search engine. (Many of my readers would probably check their RSS aggregators.) There should be a known way to specify to the search engines that they should return only real-time and situational specific information. There needs to be a way for "officials" to disseminate information to such feeds. There needs to be ways to establish limited access feeds and easily give access to such feeds to appropriate people, much as "talk groups" are used on "point-to-point" radio.
We must guard against the attitude that only the authorities know best. In many cases, civilians are closest to details and may come up with the appropriate improvisation to an unforeseen situation. They must be part of the solution, and therefore must be used for getting information and must have access to it.
Summary and Next Steps
This essay covers a wide range of topics. It introduces "Normal Accident Theory", looks at some of the aspects of a major terrorist attack, and proposes some areas for design that are suggested by the results of that attack. The original goal, though, was to come up with some principles that could be applied to making software that fits with the long-term needs of society. Here are some of those principles:
Instrument the sub-systems and components so that failures can be detected and so that behavior can be monitored when there are changes. There is a need to know "what is going on".
Examine failures and share what is found with others so that there is learning.
Try to keep sub-systems loosely coupled, the interfaces understandable, and the intermediate steps comprehensible.
Allow for, and anticipate, improvisation. The design of instrumentation and the coupling of sub-systems can make improvisation easier or harder.
Those who deal with changes may not be the ones for whom the designers planned nor who were pre-trained to deal with those changes. This affects the design of instrumentation, coupling, and documentation.
Generic, "global" resources help and should be able to be used as part of instrumentation and improvisation.
The next step will be to put these together with other principles gleaned from other areas.
-Dan Bricklin, 7 September 2004
© Copyright 1999-2010 by Daniel Bricklin
All Rights Reserved.