Human Systems Integration Archives » Engineering for Humans

Written in Blood: Case Studies of Systems Engineering Failure

Posted by Benjamin Schwartz on 8 December 2024

Our world and our systems are safer than ever. A major reason why is that we’ve learned from prior mistakes. Many of our practices, rules, and standards are “written in blood” from past, tragic failures.

We learn so that we don’t repeat the same mistakes. Of course, we first identify the proximate causes—the specific events directly leading to a casualty. To truly learn, we must take a step back to examine the larger context: what were the preceding holes in the Swiss cheese, and how do we account for them in our systems engineering practice? This approach is increasingly important as systems grow in complexity. This post describes three case studies illustrating a common theme: increasing complexity demands strong systems engineering approaches to maximize safety, system performance, and suitability.

USS McCain Collision

In 2017, USS McCain collided with the merchant vessel Alnic MC in the Singapore Strait. McCain was heading to port to fix issues with its integrated bridge and navigation system (IBNS), which was prone to bugs and failures. The ship’s captain, distrustful of the buggy IBNS system, operated in manual backup mode that bypassed several safeguards.

The incident starts in the early morning in a busy shipping lane, a high-demand situation. To reduce workload on junior helmsmen, the captain ordered a “split helm” with one station controlling steering and one controlling speed. In the backup manual mode, changing to split helm required a multi-step, manual process. The crew made several mistakes, starting with transferring steering to the wrong station without realizing it.

Believing steering was lost, the helmsman instinctively reduced thrust to slow the ship. However, the incomplete manual transfer process had left the port and starboard throttle controls unganged, and only the port shaft slowed. Instead of decelerating, the McCain veered directly into the path of the Alnic MC.

Helm station of a US Navy ship — *IBNS on USS* Dewey; US Navy photo

In the image above, from a sister ship of McCain, you will notice a large red button meant to take emergency control of the ship. This was certainly a situation in which to use it. However, the crew had an incorrect understanding of how this button worked and what would happen when it was pressed. Over the course of 16 seconds, multiple crewmembers pressed the big red buttons at their stations and control switched between the bridge and aft backup stations three times. By the time the crew regained control, it was too late. A few seconds later, the bow of Alnic MC struck McCain, and ten sailors died.

USS McCain with damage from the collision — *Damage to* McCain; US Navy photo

The crew of McCain had indeed lost control of the ship. However, it was not because of any technical failure. It was caused by a confluence of design, training, and operational deficiencies. Among these, a lack of trust in automation caused them to operate in backup manual mode without support and safeguards that would have improved safety. Most pressing, in my opinion, is the confusing controls that didn’t retain the functionality of traditional physical controls. Critically, the IBNS on McCain removed the physical throttles in favor of only digital displays; physical throttle controls would have made it obvious to the entire bridge team at a glance that the controls were not ganged.

Chekov’s Gear Shifter

Composite of the Jeep gear shifter and actor Anton Yelchin in Star Trek costume

The same type of design shortcoming affects our daily lives as well. Anton Yelchin, known for playing Chekov in the 2009 Star Trek film, died tragically in 2016 at age 27. After parking his Jeep Grand Cherokee in his driveway, he exited the vehicle, unaware it was still in neutral. The car rolled backward, crushing him against a wall and killing him.

The issue was the design of the “monostable” shift knob, pictured above. At the time of this incident, the design had already been implicated in 266 crashes causing 68 injuries. Complaints were that the shifter didn’t provide adequate tactile or position feedback to the driver, causing incorrect gears to be selected and particularly putting the vehicle into neutral or reverse instead of park. Fiat Chrysler had already initiated a voluntary recall based on these complaints and was working on a software update to provide better feedback and to prevent the car from moving in certain conditions. Unfortunately, the fix wasn’t available at the time of Yelchin’s death, and the incident caused Fiat Chrysler to fast-track the development and fielding of the fix.

This case highlights the risks of abandoning proven design principles. The monostable shifter resembled traditional designs but worked very differently, confusing users without adding functional benefits.

Breaking established design patterns is not necessarily a bad thing. Historically, function drove form with, for example, mechanical linkages between the shifter and transmission. User’s are able to develop mental models of the system by following these physical paths. Moving to software-defined systems enables much more flexibility in design, which can enhance performance and safety. But it also removes many of the constraints of physical systems, breaks mental models, and allows solutions to become much more complex. That can result in undesired emergent behavior.

Two related design principles are “if it’s not broken, don’t fix it” and to build on a user’s existing mental model. When it is necessary to depart from existing concepts to achieve solution objectives, the designer must carefully consider the potential impacts and account for them. The monostable shifter was problematic because it looks like a traditional shifter but works differently. Not only did that trip up users, it didn’t actually add anything to the effectiveness of the solution; it was just for the sake of being different.

Aircraft Flight Controls

Composite of older and newer aircraft cockpits — *Composite of a Hawker Siddeley Trident (left) and an Airbus A380 (right) cockpit (Originals by Nimbus227 and Naddsy CC BY 2.0)*

A positive example comes from aviation. Across manufacturers, models, and decades of technology advancement, controls and displays have remained relatively consistent with proven success. Basic primary flight display layout and colors are similar across aircraft, whether glass displays or traditional gauges; essential controls have the same shape coding and movement in any aircraft.

An effective, optimized, proven layout allows the pilot to focus on managing the aircraft and executing their mission. It also enables skills transfer; a pilot learning a new aircraft can focus on the unique qualities of that type rather than re-learning basic flight controls and displays.

Automation in modern aircraft has increased substantially. Airbus pioneered fly-by-wire; there are no direct linkages between cockpit controls and aircraft control surfaces, all inputs are mediated by software. That radically enhances safety. It’s nearly impossible to stall an Airbus because the flight computer won’t let you leave the safe flight envelope. Even still, it’s not infallible.

Air France Flight 447

In 2009, Air France flight 447 crashed into the Atlantic on a flight from Rio de Janeiro to Paris. The aircraft entered icing conditions and the pitot tubes froze over, causing airspeed data to be unreliable. Without valid airspeed data, the autopilot disconnected, the autothrust became unavailable, and the flight control software switched to an alternate control law. In this alternate law, the software didn’t mediate the pilot’s control inputs and so the controls were much more sensitive.

The most junior of the pilots onboard was the pilot flying and he struggled to adapt to the abrupt change. He spent the first 30 seconds getting a back-and-forth roll under control, over-correcting as he got used to the more sensitive handling of the aircraft. As he was fighting this roll, he also pulled back on the stick, which is a natural tendency of pilots in tense situations; that caused the aircraft to climb very steeply and ultimately stall. He continued to pull back almost the entire rest of the flight, even as the more experienced pilot tried to take control and push the nose forward to regain airspeed; the mismatched inputs caused “DUAL INPUT” warnings that the crew either didn’t notice or ignored, and the plane continued to respond to the junior pilot’s incorrect inputs.

Without the software providing flight envelope protection, the normally-unstallable aircraft fell out of the sky and 228 people died. The last thing the pilot flying said was “We’re going to crash! This can’t be true. But what’s happening?“ Just like with the McCain, there was nothing inherently wrong with the aircraft, just a disconnect between the user’s understanding and the actual state of the system, fueled by a rapid change of state, a stressful situation, and inability to rebuild situational awareness in time.

*Recovery of wreckage from Flight 447 (Roberto Maltchik Repórter da TV Brasil*, CC BY 3.0 br)

There are several lessons in this case study, but what stands out is the paradox of automation. The pilot was so used to the safety of the flight control software that his basic aviator skills weren’t available when he needed them. There are aspects of flight control design, a need for graceful degradation, and better training.

Lessons Learned

The key takeaway from these case studies for systems engineering practice is that the performance of the system is the product of human performance and technology performance. It doesn’t matter how great the technology is if the human can’t use it to safely and effectively accomplish their mission. That’s especially true in unusual or off-nominal situations. The robustness of the system depends on the ability of the human and/or technology to account for, adapt to, and recover from unusual situations.

System Performance = Human Performance x Technology Performance

With the rise of software-defined systems, complexity is outpacing our ability to characterize emergent behavior. One role of systems engineering is to minimize and manage this complexity. Human-centered approaches have proven to be effective at supporting user performance within complex systems, especially when combined with the frequent user feedback and iteration of the agile methodology. Building from user needs ensures that complexity is added when necessary for the sake of the solution rather than because it’s cool (Jeep monostable shifter) or economical (McCain IBNS). It also suggests other, non-design aspects that support user performance such as decision aids and training (AF447).

Adding human-centered design in SE is easy:
• Work from first principles, always
• Model user workflows, then ask “how might we…?”
• Evolutionary vs. revolutionary technology
• Thoughtfulness is next to godliness

Create for the user and the system will be successful

Human-centered approaches are a natural part of a holistic systems engineering program. Too often, engineers focus on developing technological aspects without a true understanding of the stakeholder and mission needs. These are the first principles that should guide all of our design decisions before we start to write code or bend metal.

From that deep understanding, we can ask my favorite question: how might we? Sometimes ‘how might we’ leads to small, incremental changes that add up to major performance gains. Other times it demands entirely rethinking the problem and solution, especially if revolutionary technology improvements are available.

Finally, designs must be thoughtful, putting practical needs ahead of novelty or cost savings to create the most effective solutions. This approach not only enhances system performance but also prevents failures. As the U.S. Navy report on the McCain collision noted:

“There is a tendency of designers to add automation based on economic benefits (e.g., reducing manning, consolidating discrete controls, using networked systems to manage obsolescence), without considering the effect to operators who are trained and proficient in operating legacy equipment.”
— US Navy report on the McCain collision

By prioritizing user needs, we can create systems that are safe, effective, and resilient in the face of challenges.

What case studies or examples have influenced your thinking? How do you apply user-centered or other approaches successfully in your practice? Share in the comments below.

System Design Lessons from the USS McCain

Posted by Benjamin Schwartz on 16 June 2024

The Navy installed touch-screen steering systems to save money.

Ten sailors paid with their lives.
ProPublica

*USS* McCain *in 2019 (U.S. Navy Photo)*

Ten sailors died after the crew of the destroyer USS John S. McCain lost control of their vessel, causing a collision with the merchant tanker Alnic MC. There was nothing technically wrong with the vessel or its controls. Though much of the blame was put on the Sailors and Officers aboard, the real fault rests with the design of the Integrated Bridge & Navigation System (IBNS).

Human Factors Design Drives System Performance

Posted by Benjamin Schwartz on 28 March 2021

Bottom Line Up Front:

Human performance is a major factor in overall system performance
Humans are increasingly the bottleneck for system performance
Human factors engineering design drives human performance and thus system performance

Why care about humans?

In many system development efforts, the focus is on the capabilities of the technology: How fast can the jet fly? How accurately can the rifle fire?

We can talk about the horsepower of the engines and the boring of the rifle until the cows come home, but without a human pressing the throttle or pulling the trigger, neither technology is doing anything. A major mistake many systems engineering efforts experience is neglecting the impact of the human on the performance of the system.

DoD re-affirms importance of HSI in acquisition policy overhaul

Posted by Benjamin Schwartz on 27 September 2020

The US Department of Defense is overhauling it’s acquisition policy from a stale, process-driven approach to a new, outcome-driven approach. The new concept is called the Adaptive Acquisition Framework (AAF). Its goal is to remove bureaucracy and give government program managers more flexibility to adapt to the needs of their particular acquisition.

The Swiss cheese model: Designing to reduce catastrophic losses

Posted by Benjamin Schwartz on 21 July 2019

Failures and errors happen frequently. A part breaks, an instruction is misunderstood, a rodent chews through a power cord. The issue gets noticed, we respond to correct it, we clean up any impacts, and we’re back in business.

Occasionally, a catastrophic loss occurs. A plane crashes, a patient dies during an operation, an attacker installs ransomware on the network. We often look for a single cause or freak occurrence to explain the incident. Rarely, if ever, are these accurate.

System lexicons and why your project needs one

Posted by Benjamin Schwartz on 21 April 2019

A system lexicon is a simple tool which can have a big impact on the success of the system. It aligns terminology among technical teams, the customer, subcontractors, support personnel, and end users. This creates shared understanding and improves consistency. Read on to learn how to implement this powerful tool on your program.

The Role of the Human Systems Integrator

Posted by Benjamin Schwartz on 1 March 2019

This article is required reading for anyone who needs to hire, wants to become, or is going to be working with an HSI expert. Understand what the job entails, the key skills required, and how it relates to the rest of the systems engineering effort.

The Value of HSI

Posted by Benjamin Schwartz on 23 February 2019

The application of human systems integration (HSI) throughout a project results in improved system performance, reduced lifecycle cost, reduced development risk, and no increase in development cost when executed effectively.

Is Human Systems Integration Different from Human Factors Engineering?

Posted by Benjamin Schwartz on 22 February 2019

DoD acquisition policy requires Human Systems Integration (HSI). Various human-centered engineering approaches are also gaining traction outside of military projects. But is HSI just a fancier way of saying Human Factors Engineering (HFE)?

Human Systems Integration: The Basics

Posted by Benjamin Schwartz on 17 February 2019

Human systems integration (HSI) is a systems engineering function with the goal of optimizing system performance and cost across the entire system lifecycle. It ensures that the human elements of the system are given at least as much consideration as any other component across the entire project. HSI is a relatively new and often misunderstood term. Here are the basics: