• Home
  • Category: Systems Engineering

The Swiss cheese model: Designing to reduce catastrophic losses

Failures and errors happen frequently. A part breaks, an instruction is misunderstood, a rodent chews through a power cord. The issue gets noticed, we respond to correct it, we clean up any impacts, and we’re back in business.

Occasionally, a catastrophic loss occurs. A plane crashes, a patient dies during an operation, an attacker installs ransomware on the network. We often look for a single cause or freak occurrence to explain the incident. Rarely, if ever, are these accurate.

The vast majority of catastrophes are created by a series of factors that line up in just the wrong way, allowing seemingly-small details to add up to a major incident.

The Swiss cheese model is a great way to visualize this and is fully compatible with systems thinking. Understanding it will help you design systems which are more resilient to failures, errors, and even security threats.

Holy cheese

Four slices of Swiss cheese, an arrow going through a set of holes illustrates failures through multiple levels of controls
A version of the Swiss Cheese Model; an image search will turn up a number of alternatives
© Davidmack, used under a CC BY-SA 3.0 license.

The Swiss Cheese Model was created by Dr. James Reason, a highly regarded expert in the field of aviation safety and human error. In this model, hazards are on one side, losses are on another, and in between are slices of Swiss cheese.

Each slice is a line of defense, something that can catch or prevent a hazard from becoming a catastrophic loss. This could be anything: backup components, monitoring devices, damage control systems, personnel training, organizational policies, etc.

Of course, Swiss cheese is famous for its holes. In the model, each hole is a gap in that layer that allows the hazard condition to progress. A hole could be anything: a broken monitoring device or backup system, an outdated regulation or policy, a misunderstanding between a pilot and air traffic control, a receptionist vulnerable to social engineering, a culture of ‘not my job’.

If you stack a bunch of random slices of Swiss, the holes don’t usually line up all the way through. A failure in one aspect of the system isn’t catastrophic because other aspects of the system will catch it. This is a “defense in depth” strategy; many layers means many opportunities to prevent a small issue from becoming a major issue.

As shown in the diagram, sometimes the holes do line up. This is the trajectory of an accident, allowing an issue to propagate all the way through each layer until the catastrophic loss.

UPS 747 cargo plano
UPS 747 Cargo Plane
© Frank Kovalchek, used under a CC BY 2.0 license

A great example is UPS Flight 6, which crashed in Dubai in 2010. A fire broke out on board, started by lithium ion batteries which were being improperly shipped. Other planes, including UPS planes, have experienced similar fires but were able to land safely. This fire had to proceed through many layers before the crash happened:

  1. hazardous cargo policy — failed to properly identify the batteries and control where they were loaded in the plane
  2. smoke detection system — inhibited because rain covers on the shipping pallets contained the smoke until the fire was very large
  3. fire suppression system — not intended for the type of fire caused by batteries, thus less effective
  4. flight control systems — unable to withstand the heat and made controlling the plane increasingly difficult
  5. air conditioner unit failed — apparently unrelated to the fire, allowed the cockpit to fill with smoke
  6. cockpit checklists and crew training — didn’t have sufficient guidance for this type of situation, leading the crew to make several mistakes which exacerbated the situation
  7. pilot’s oxygen mask was damaged by the heat — he became incapacitated and likely died while still in the air
  8. copilot oxygen mask — he had on a mixed-atmosphere setting instead of 100% oxygen, allowing some smoke into his mask and reducing his effectiveness
  9. air traffic control wasn’t monitoring the emergency radio frequency — copilot tried to use this (international-standard) frequency, but air traffic controllers were not; he couldn’t find the airport without directions from the controllers

Ultimately, the flight control systems failed completely and the copilot could no longer control the aircraft. As you can see, this incident could have ended very differently if any single one of those nine layers did not allow the accident to progress1.

Applying the model in design

This model is most often used to describe accidents after-the-fact. But it’s just as applicable for describing the resiliency of the system during the design phase, where applying it can have the greatest impact on the safety and security of the system.

Add layers deliberately and with care

The layers of cheese in the model suggest that the easy solution to many holes is to add more layers of cheese. Another inspection step, component redundancy, a review gate, etc.

This is a defense-in-depth strategy and it’s essential for the first few layers. However, it quickly becomes onerous and costly. It can also backfire by providing a false sense of security (‘I don’t have to catch 100% of problems because the next step will’).

Before adding a slice, carefully analyze the system to determine if there may be a better way to address the concern.

Fill the holes

Most often, the best solution is to minimize the holes in each layer by making them more robust or to replace a layer with one that better addresses all of the risks.

An easy example might be the copilot’s oxygen mask setting from the UPS example above. The copilot had chosen a setting which varied the amount of oxygen based on altitude. In one sense, this setting makes sense; at a lower altitude the need for supplemental oxygen is lower. In another sense, there’s no risk in providing too much oxygen, so why not provide a simpler system which only delivers 100% oxygen and has less risk of error?

The best designs add minimal friction while providing value to the users. For example, a maintenance system which pre-fills documentation when the user scans a part. The user prefers it because it simplifies their job, documentation is more complete/accurate, and the system can automatically double-check that the part is compatible.

This is always going to be easiest to accomplish during initial design rather than added after the fact. Software engineering has made a big step forward in this regard with DevSecOps, baking security right in rather than trying (usually with little success) to force it on the system later. Resiliency should be incorporated into every step of an engineering project.

Analyze and accept risk

Finally, there’s never going to be a 100% safe and secure system. We must quantify the risks as best as possible, control them as much as practicable, and eventually accept the residual risk.

How have you applied the Swiss cheese model in your work? Do you have any criticisms of it or perhaps an alternative perspective? Share your thoughts in the comments.

It’s time to get rid of specialty engineering: A criticism of the INCOSE Handbook

Chapter 10 of the INCOSE Systems Engineering Handbook covers “Specialty Engineering”. Take a look at the table of contents below. It’s a hodge-podge of roles and skillsets with varying scope.

Table of contents for the Specialty Engineering section of the INCOSE handbook.
Table of contents for the Specialty Engineering section of the INCOSE handbook.

There doesn’t seem to be rhyme or reason to this list of items. Training Needs Analysis is a perfect example. There’s no doubt that it’s important, but it’s one rather specific task and not a field unto itself. If you’re going to include this activity, why not its siblings Manpower Analysis and Personnel Analysis?

On the other hand, some of the items in this chapter are supposedly “integral” to the engineering process. This is belied by the fact that they’re shunted into this separate chapter at the end of the handbook. In practice, too, they’re often organized into a separate specialty engineering group within a project.

This isn’t very effective.

Many of these roles really are integral to systems engineering. Their involvement early on in each relevant process ensures proper planning, awareness, and execution. They can’t make this impact if they’re overlooked, which often happens when they’re organizationally separated from the rest of the systems engineering team. By including them in the specialty engineering section along with genuinely tangential tasks, INCOSE has basically stated that these roles are less important to the success of the project.

The solution

The solution is simple: re-evaluate and remove, or at least re-organize, this section of the handbook.

The actual systems engineering roles should be integrated into the rest of the handbook. Most of them already are mentioned throughout the document. The descriptions of each role currently in the specialty engineering section can be moved to the appropriate process section. Human systems integration, for example, might fit into “Technical Management Processes” or “Cross-Cutting Systems Engineering Methods”.

The tangential tasks, such as Training Needs Analysis, should be removed from the handbook altogether. These would be more appropriate as a list of tools and techniques maintained separately online, where it can be updated frequently and cross-referenced with other sources.

Of course, the real impact comes when leaders internalize these changes and organize their programs to effectively integrate these functions. That will come with time and demonstrated success.

The Boeing 737 Max crashes represent a failure of systems engineering

The 737 is an excellent airplane with a long history of safe, efficient service. Boeing’s cockpit philosophy of direct pilot control and positive mechanical feedback represents excellent human factors1. In the latest generation, the 737 Max, Boeing added a new component to the flight control system which deviated from this philosophy, resulting in two fatal crashes. This is a case study in the failure of human factors engineering and systems engineering.

The 737 Max and MCAS

You’ve certainly heard of the 737 Max, the fatal crashes in October 2018 and March 2019, and the Maneuvering Characteristics Augmentation System (MCAS) which has been cited as the culprit. Even if you’re already familiar, I highly recommend these two thorough and fascinating articles:

  • Darryl Campbell at The Verge traces the market pressures and regulatory environment which led to the design of the Max, describes the cockpit activities leading up to each crash, and analyzes the information Boeing provided to pilots.
  • Gregory Travis at IEEE Spectrum provides a thorough analysis of the technical design failures from the perspective of a software engineer along with an appropriately glib analysis of the business and regulatory environment.

Typically I’d caution against armchair analysis of an aviation incident until the final crash investigation report is in. However, given the availability of information on the design of the 737 Max, I think the engineering failures are clear even as the crash investigations continue.

Hazard analysis

The most glaring, obvious, and completely inexplicable design choice was a lack of redundancy in the MCAS sensor inputs. Gregory Travis blames “inexperience, hubris, or lack of cultural understanding” on the part of the software team. That certainly seems to be the case, but it’s nowhere near the whole story.

There’s a team whose job it is to understand how the various aspects of the system work together: systems engineering2. One essential job of the systems engineer is to understand all of the possible interactions among system components, how they interact under various conditions, and what happens if any part (or combination of parts) fails. That last part is addressed by hazard analysis techniques such as failure modes, effects, and criticality analysis (FMECA).

The details of risk management may vary among organizations, but the general principles are the same: (1) Identify hazards, (2) categorize by severity and probability, (3) mitigate/control risk as much as practical and to an acceptable level, (4) monitor for any issues. These techniques give the engineering team confidence that the system will be reasonably safe.

FAA Safety Risk Management Process flowchart and Risk Categorization Matrix table
FAA Safety Risk Management Process and Risk Categorization Matrix from FAA Order 8040.4B, Safety Risk Management Policy.

On its own, the angle of attack (AoA) sensor is an important but not critical component. The pilots can fly the plane without it, though stall-protection, automatic trim, and autopilot functions won’t work normally, increasing pilot workload. The interaction between the sensor and flight control augmentation system, MCAS in the case of the Max, can be critical. If MCAS uses incorrect AoA information from a faulty sensor, it can push the nose down and cause the plane to lose altitude. If this happens, the pilots must be able to diagnose the situation and respond appropriately. Thus the probability of a crash caused by an AoA failure can be notionally figured as follows:

P(AoA sensor failure) × P(system unable to recognize failure) × P(system unable to adapt to failure) × P(pilots unable to diagnose failure) × P(pilots unable to disable MCAS) × P(pilots unable to safely fly without MCAS)

AoA sensors can fail, but that shouldn’t be much of an issue because the plane has at least two of them and it’s pretty easy for the computers to notice a mismatch between them and also with other sources of attitude data such as inertial navigation systems. Except, of course, that the MCAS didn’t bother to cross-check; the probability of the Max failing to recognize and adapt to a potential AoA sensor failure was 100%. You can see where I’m going with this: the AoA sensor is a single point of failure with a direct path through the MCAS to the flight controls. Single point of failure and flight controls in the same sentence ought to give any engineer chills.

The next link in our failure chain is the pilots and their ability to recognize, diagnose, and respond to the issue. This implies proper training, procedures, and understanding of the system. From the news coverage, it seems that pilots were not provided sufficient information on the existence of MCAS and how to respond to its failure. Systems and human factors engineers, armed with a hazard analysis, should have known about and addressed this potential contributing factor to reduce the overall risk.

Finally, there’s the ability of the pilots to disable and fly without MCAS. The Ethiopian Airlines crew correctly diagnosed and responded to the issue but the aerodynamic forces apparently prevented them from manually correcting it. The ability to override those forces, plus the time it takes to correct the flight path, should have been part of the FMECA analysis.

I have no specific knowledge of the hazard analyses performed on the 737 Max. Based on recent events, it seems that the risk of this type of failure was severely underestimated or went unaddressed. Either one is equally poor systems engineering.

Cockpit human factors

An inaccurate hazard analysis, though inexcusable, could be an oversight. Compounding that, Boeing made a clear design decision in the cockpit controls which is hard to defend.

In previous 737 models, pilots could quickly override automatic trim control by yanking back on the yoke, similar to disabling cruise control in a car by hitting the brake. This is great human factors and it fit right in with Boeing’s cockpit philosophy of ensuring that the human was always in ultimate control. This function was removed in the Max.

As both the Lion Air and Ethiopian Airlines crew experienced, the aerodynamic forces being fed into the yoke are too strong for the human pilots to overcome. When MCAS directs the nose to go down, the nose goes down. Rather than simply control the airplane, Max pilots first have to disable the automated systems. Comparisons to HAL are not unwarranted.

In summary

Boeing is developing a fix for MCAS. It will include redundancy in AoA sensor inputs, not activating MCAS if the sensors disagree, MCAS activating only once per high-angle indication (i.e. not continuously activating after the pilots have given contrary commands), and limiting the feedback forces into the control yoke so that they aren’t stronger than the pilots. This functionality should have been part of the system to begin with.

Along with these fixes, Boeing is likely3 also re-conducting a complete hazard analysis of MCAS and other flight control systems. Boeing and the FAA should not clear the type until the hazards are completely understood, controlled, quantified, and deemed acceptable.

Many news stories frame the 737 Max crashes in terms of the market and regulatory pressures which resulted in the design. While I don’t disagree, these are not an excuse for the systems engineering failures. The 737 Max is a valuable case study for engineers of all types in any industry, and for systems engineers in high-risk industries in particular.

System lexicons and why your project needs one

A system lexicon is a simple tool which can have a big impact on the success of the system. It aligns terminology among technical teams, the customer, subcontractors, support personnel, and end users. This creates shared understanding and improves consistency. Read on to learn how to implement this powerful tool on your program.

Read More

An Engineering Touchstone to Enable Successful Designs

Successful systems are created by engineers who understand and design to the ultimate objectives of the project. When we lose sight of those objectives we start making design decisions based on the wrong criteria and thus create sub-optimal designs. Scope creep, group think, and simple convenience are frequent causes of this type of variation. An effective design assessment tool is a touchstone by which we can evaluate the effectiveness of ongoing design decisions and keep the focus on the optimal solution.

Read More