Normal Accidents
Charles Perrow
This is a book about complexity and how it leads to accidents. It was written in 1984, soon after the Three Mile Island nuclear plant accident, and an additional section was written in 1999. I became aware of this book in the week after the fire at Notre dame and the second crash of a Boeing 737 Max 8. An article (When making things better only makes them worse ) outlined how this book described the inevitability of such accidents.
Perrow starts with an everyday example of an accident. You are preparing for a job interview this morning. Your partner left the coffee pot on the stove (remember, this is 1984) and the pot cracked. You get out the old percolator and prepare coffee, which puts you a bit behind schedule. You rush out of the house to find that you’ve locked your car and house keys inside. Last week you lent your spare key to a friend to drop something off when you’d be away. You go to borrow a neighbor’s car, but it is in the shop for a repair. As a last resort, you start for the bus stop only to learn that the drivers went on strike this morning. Despite building in a redundancy (spare key) and having a backup system (neighbor car and bus), you miss the interview. The odds against all of these things going wrong seem outlandish and there is no sense that one misadventure caused any of the others. Arguably, the broken coffee pot or forgotten keys could be classified as human error, but the failures of the redundancies and backup components involved no human error. We tend to forget just how complex our ordinary lives are and how interdependent our activities are. This complexity and interdependence makes accidents of this sort inevitable.
Perrow’s field of study was sociology and organizations – not technology. For him, the big questions were organizational. How should organizations operate to deal with different kinds of accident potentials? What are the conditions under which people must make decisions that result in accidents being caused or prevented?
At the heart of Perrow’s analysis is the idea of a grid with two dimensions. One dimension is linearity-complexity. A linear system is one in which each step leads to a single subsequent step. A leads only to B. A complex system involves steps that lead to many steps - or where steps diverge or converge. A leads to B and C, and sometimes D. D also depends on E. Complex systems sometimes circle back on themselves (think about how money moves through an economy). A feature of complex systems is that they have a degree of unpredictability in them. Assembly lines and rail systems are generally linear, but most chemical plants are complex. The second dimension is coupling. A loosely coupled system has a degree of slack in it while a tightly coupled system does not. A loosely coupled system might be quite confusing or ambiguous in its reaction to events. In general, schools or the economy are loosely coupled systems, but chemical plants are tightly coupled. This is illustrated as seen below. Some manufacturing systems combine linear and complex process steps, and tight and loosely coupled steps. Even deciding what kind of a system is in place can be complex.
Systems are composed of sub-systems, units and parts. Accidents in loose and linear systems often occur when parts fail. A part failure can cause a unit failure, but thanks to linearity they are easy to understand. Thanks to loose coupling, their impact is limited. But when systems are tightly coupled and complex, things are different. The complexity makes changes in the system hard to understand. Are things actually normal or are they deviant? What is the cause and what is the effect? Accidents in complex systems almost always involve an incomprehensible element that confuses people. Because they are tightly coupled, changes in one part of the system are rapidly spread to other parts of the system. To be a bit more exact, the interactions between parts means that things can spiral out of control more quickly. In this terminology, parts or unit failures are incidents and generally fairly inconsequential. Complexity and coupling allow incidents to be magnified into accidents. In using this set of descriptions, it should be clear that incidents may have a degree of randomness, but accidents do not. Equipment fails or people follow the wrong procedure (people can be a part in this sense) which lead to incidents, but thanks to loose coupling or linearity, the effects remain minor. In tight, complex systems, the same incident can become a major accident. The book describes a number of accidents in various circumstances to illustrate how seemingly small incidents (a safety tag blocking an indicator light) can be magnified into big accidents.
The term “coupling” does not mean physical connected. The example of the coffee pot-interview failure did not involve any physical connection. The individual incidents were only coupled by their impact on arriving on time. The book cites the example of a major accident in Texas City, Texas in 1947. An arriving fertilizer ship caught fire (the fertilizer was quite explosive). The ship’s crew failed to put out the fire and the fire department was called in; the ship exploded despite their efforts. Debris was thrown 3 miles into eh air and two airplanes were incinerated. Nearby fuel tanks caught on fire and exploded as did an adjacent chemical plant. Another ship containing fertilizer tried to back out of the docks and rammed a third ship, and they could not be towed them out of the harbor (and perhaps nobody tried to). The next day about one-third of the city was on fire, the two other ships also exploded, ignited a sulfur plant and caused additional damage. About 560 people died and 3000 were injured. Proximity in time and space coupled these events and turned whatever incident started the first fire into a major accident with high human and economic cost.
Perrow also describes four groups of accident victims. First-parties are the people actually working in the system during the accident. For example, the pilots of a plane that crashes. Second-party victims are those connected to the accident, but not directly harmed. For examples, if a plane crash causes the airline to go bankrupt, the office employees would be second-party victims. The passengers on the plane are the third-party victims. Fourth-party victims are those in the future who will be victims as a consequence of the accident. Imagine that the plane was carrying some radioactive materials that were released into the environment. A child born with a birth defect as a consequence of their mother’s exposure would be a fourth-party victim. This is important because it is clear that organizations do not consider each of these groups of victims similarly and may not be aware of some of them at all.
Most of the examples in the book come from the United States, and most are associated with for-profit companies, but the book makes it very clear that government organizations and business organizations have the same problems. Similarly, free market and planned economies have the same problems. Essentially all organization are under pressure to produce something, and the organization’s goal and management are directed at producing the maximum amount of that something for the minimal cost (in time, money, resources, etc.). Whether the government is building a dam or a company is running a petrochemical plant, there is a drive to produce. Thus people in the organization may have occasions to take risks to insure production (keep to schedule, keep costs down, meet quotas, etc.) that can lead to accidents. In a very important sense, the question is not about risk but about who has the power to decide to put different victims at risk. This is especially important when the benefits of success are gained by different people than those who are placed at risk.
Our view of risk is influenced by context. Production pressures is part of our context. Our training is part of our context too. But one of the bigger contributors to our context is our sense of control. We are much happier driving our car because we think we have control than we are flying a plane where we know we have no control. One of the effects of training is to increase our sense of control and this permits us to take risks. This is both very desirable in a practical sense because there are a range of risks that trained people should run to be effective. You must be trained to drive a car – an essential task for most adults. Training tends to be very effective to prepare for the risk of part or unit failure. Where training becomes difficult is when it must explain interactions between systems. In other words, it is much harder to train people on the consequences of a unit failure on a sub-system than it is to train people on the consequences of a part failure on a unit. Thus our training in driving does not cover the consequences of specific situations where interactions become more important. The risk that is totally sensible with respect to a part might not be sensible in terms of the system. In fact, training might disable your ability to diagnose a situation because it guides you to see the problem in terms of parts or units. Your expertise may limit your vision.
Organizations must make a decision about how to manage situations with accident potential – and the options boil down to centralization of decentralization. Centralization can take a few different forms. For example, automation is actually centralization in that a designer makes choices that are encoded into equipment which reacts to the situation. Centralization also takes the form of intensive training and high discipline systems. Ships are centrally managed by the captain traditionally, with little direct input from shore – this is a form of centralization. The following figure looks at centralization-decentralization through the coupling-complexity frame. In general, tight coupling calls for centralization and complexity calls for decentralization.
In linear-tight systems, centralization is the best choice due to low uncertainty about the process’ response to changes (due to tight coupling). A system like this can be automated. In complex-loose systems, decentralized control is preferred. There is a good chance for unexpected outcomes and changes may be slow to be realized or take unexpected directions. This makes it hard to automate and requires the attention of people directly experiencing the situation.
Linear-loose systems can be managed with either approach with equally good effects. The choice is often dictated by organizational tradition. Complex-tight systems present the greatest difficulty. The complexity suggests decentralization, while the tight coupling suggests decentralization. In one sense, this combination is incompatible with good control. In practice, organizations probably choose the dimension of greater concern, organize around that dimension and try to build some sort of layered defense against accidents in the other dimension. For example, engineers may try to decrease coupling in the system by introducing buffers or permitting greater process variation.
Based on this perspective, within an organization, the choice of approach to accident prevention depends on the situation. Application of a single approach across a diverse set of circumstances might lead to many undesirable outcomes. We sometimes see accidents lead to inappropriate policies because they policies are applied unthinkingly to all situations.
Perrow dismisses greed, capitalism, technology, or operator error as sources of these systemic accident risks. Instead, he focuses on “externalities”. An externality is the social cost of an activity that is not reflected in the price. For example, today a company that burns coal does not pay for the possible future damage done by a higher concentration of carbon dioxide in the atmosphere. A poorly designed dam that collapses exports the risk of lost lives to those living downstream. When regulators gloss over the design deficiencies, they also export risk. When an organization can avoid an expense by “exporting” the risk to others, this is an externality. Perrow observes that most incidents do not expand into serious accidents. Perhaps a few people are hurt or some property is damaged. Economic and political systems may view the expense of compensation to be lower than the cost of prevention. The whole point of insurance is to socialize risk, and many risks are exported to taxpayers and citizens (think disaster relief to hurricane-prone areas, crop insurance, or malpractice insurance). Most conflicts between business and politicians about regulatory burden is really an argument about who should pay for the consequences of risk taken and when these costs should be paid (before or after a possible accident).
Indirectly, a significant topic of this book is the question, “How much safety is enough?” Perrow notes that complex systems can’t be defined well enough to insure complete safety. Damage from more linear systems or from more loosely coupled systems might be more easily confined. But systems with greater catastrophic potential present a different kind of problem. Because of the uncertainties associated with complex systems, we may be unable to understand all of the consequences of an incident. The system may be more tightly coupled than we understand, so some incidents may become very significant. To put it more bluntly, analysis is almost a question of imagination rather than rational probability. An organization will want to maintain production and will argue that risks are well managed. An “intervener” group will describe a worst case scenario and point to the company’s history of safety incidents. Associated with the question above is a second question, “Who decides when there is enough safety?”
Research has also shown that the public expects large organizations to be responsible for their actions and the harm they’ve done. Support for regulation may be the result of large organizations failing to act in the public interest. Interest or advocacy groups are sometimes needed to highlight examples where organizations get the balance wrong in order to insure public pressure for regulation.
The book addresses some underlying problems with risk assessment as commonly practiced. The risk associated with an accident is the combination of the probability and impact. Automobile accidents are common but individually low impact. A plane crash is rare but higher impact. Operations in use for a really long time create a substantial history which allows a sound understanding of potential problems and solutions. This makes assessment of risks and their mitigation relatively practical. Airplanes have been flying for a long time with thousands of take-offs and landings each day. There is ample evidence to assess risks associated with plane design and flight operations. In contrast, there were relatively few flights of the space shuttle before the Challenger explosion. Poor risk assessment and a small base of experience supported a bad decision to launch. The impact of the Challenger explosion was low however with a small number of resulting deaths. In fact, the total loss of life in the world’s space programs is less than the loss of life in a single typical commercial airplane crash. When considering a “new” risk, the frequency of use and impact of failure must be considered. It is important to also consider how an accident would impact different types of victims. This is a key part of how a risk assessment is framed. What victims are considered? What scale of impacts? Are the system parts assessed independently or are system interactions in scope? Frequent small incidents might have more impact on third and fourth party victims than first and second party victims, while a big accident might have more impact on first party victims. Is the complexity of the system recognized? The catastrophic potential of some types of accidents leads Perrow to suggest that some systems are too complex, too tightly coupled, and too impactful to be permitted at all, while others are probably overregulated.
In the afterword in the 1999 version, Perrow returns to the importance of power. While a risk assessment may be technically sound, the real issue is who is accepting the risk and who will suffer the consequences of incidents and accidents. What is the role of “interests” on decisions makers? Perrow cites work on intra-organizational interest groups that demonstrated that even with an explicit safety policy and process set in place, there were groups of people who had an interest that countered the policy (think about how we will speed because we are in a hurry and think the road conditions are fine and we are unlikely to be caught because everyone else is speeding too). This must be part of the risk assessment too, but often is not. It is not enough to acknowledge that operator error is a problem for safe operations – it may be critical to assume that operator error is inevitable and consider how to mitigate the consequences of that inevitable event. This is why the question becomes one of power rather than risk. For some systems, accidents are inevitable. Who gets to decided that they are worth taking?
Observation and Commentary:
- As mentioned at the beginning of the summary, I learned of this book from an article about the Notre Dame fire. I learned that Notre Dame had a modern fire detection system installed and an on-site fire fighter and security person. The system misidentified the location of the fire and by the time the right location was found, the fire had spread. The book has numerous examples of safety measures that turned out to have a role in an incident turning into an accident.
- As also mentioned, an addendum to the book was written 20 years ago; the book is old. Yet Perrow’s analysis points to accidents that would appear in following years. He fairly well describes the situation that lead to the global financial problems in 2007 which were created by financiers coupling different financial instruments together. What had been loosely coupled became tightly coupled. He also explained how risks get socialized. In the case of the financial crisis, there seems to have been a mismatch between the gains from banking leading up to the crisis and the losses suffered by the same bankers (shareholders may be a different story).
- It is hard to read this book and be optimistic about organizations acting in the best interests of society. There are many examples of companies, regulators, individuals and media organizations failing to examine the origin and path of the accident. There is a failure to see how operators were helpless in the face of incorrect or unclear information. They ignore how experts disagreed about what was happening and what information the operators should have been reacting to. They ignored the many times that experts said that certain outcomes were completely impossible (thus no contingency needed to be prepared for such an event) that were seen in soon after initiation of operations.
- The author observes that some systems are error-reducing and cites air travel as one such area. Pilots are interested in air safety and organized. Politicians, who fly, are equally interested in air safety. Airlines understand that if the public is not convinced that air travel is safer than other modes (car, train, ship), they have no business. Everything works together to decrease risk. Other systems are error-inducing with marine transport as the example. Overly centralized decision making combined with volatile weather conditions, production pressures, poor training and transient workers make for complex, coupled conditions. The book spends a chapter describing how ships that are not on collision courses unwittingly maneuver to create collisions.
- The author never really talks about confidence as a problem, but perhaps one of the polarities illuminated by this book is our approach to confidence. There are some fine explanations of the link between confidence and correctness that you can find by googling “fox and hedgehog”, but the upshot is this. An expert who expresses great certainty about the safety is something provides is more likely to lead to trouble than one who is uncertain. Pick the expert who sees potential for both good and bad outcomes. We should learn to distrust certainty when thinking about what can go wrong.
- At the time of the Challenger explosion I remember seeing an analysis of the probability of an explosion that had been prepared by NASA scientists. They had compiled a list of things that could go wrong and then estimated the probability of them going wrong. I do not remember the details, but the computed probability of a launch explosion was something like one-in-a-million. The computation was faulty because it treated the different possibilities as if they were independent. They were not. When asked what the odds of a failure were given a previous failure, the odds changed enormously. The statistics of “conditional risk” are not usually a part of statistics training, but should be part of a risk assessor’s. The upshot of the revised (post-facto) calculation was something like a one-in-twenty-five chance of failure and Challenger was the 25th space shuttle launch. This is the thing about tightly coupled systems – the odds are no longer independent of each other and accidents are more probable than you think. You know this intuitively in everyday life. When the weather is near freezing, driving is less safe so you must change your driving behavior to avoid accidents.
- There have been a number of books written since this one on the subject of misunderstanding of probability. I think the book that comes closest is “The Black Swan” by Nicholas Taleb. His main point is that many events occur based on the normal distribution found in statistical populations, but they don’t really matter. The most important events are outliers, where things that could not reasonably be connected turn out to be connected and matter.
- Organizational decision making was briefly a topic in this book and focused on what was called the “Garbage Can” model. More about this can be found at: https://study.com/academy/lesson/the-garbage-can-model-of-decision-making.html or https://en.wikipedia.org/wiki/Garbage_can_model . What interests me about this view (besides it realism) is the recognition that decisions may not be related to the problems that create the need for a decision. This seems to especially be the case when the subject is strategy, but Perrow’s observation is that the same anarchic/non-logical effects blind people to the impact of risks being taken.
- I have a personal interest in the area of agricultural sustainability. Since my graduate student days, I’ve understood that some common farming practices cause damage to the environment. Some damage is localized to the farmer’s land, but some is regionalized. I now realize that some is even globalized. Since farming is as close to a required activity in our society as any, how to think about these risks is useful. Farming is one of the most dangerous occupations in the United States (but also probably global) with a large number of injuries and deaths experienced by people under 16 years. So farming has plenty of first and second party victims. Farm runoff causes water problems in coastal areas leading to decreased seafood production, creating third party victims. Finally, farm practices generate greenhouse gases that will persist in the atmosphere for centuries creating fourth party victims. Yet these activities are required and somewhat inevitable consequences of growing the crops and raising the animals that we need for our food supply. I was trained to see all of the connections between different sub-systems, parts and units – it is really complex. Some interconnections are loose and some are tight. Agriculture exposes another system problem. The biggest problems created by modern agriculture are created in a widely distributed way by independent decision makers while the consequences are mostly felt by third and fourth party victims in ways that are hard to connect to the original actions. The economic imperatives that farmers feel, compel them to act in ways that seem rational at the present time. The book comments that it is pointless to talk about capitalism or free markets as potential drivers of system accidents. There is some truth to this, but it is worth noting how important economic theory frames our choices. I don’t have space (or competence) to expand on this point, but suffice to say that the ability to turn damage into an externality and not pay for that damage at the point of creation distorts our choices. To the extent that we distort our economy to avoid paying for the damage we create, we can’t begin to make choices that will decrease the damage to future production. In this sense, climate change may be another example of a normal accident.
Recent Comments