• Home
  • Engineering for Humans

The Swiss cheese model: Designing to reduce catastrophic losses

Failures and errors happen frequently. A part breaks, an instruction is misunderstood, a rodent chews through a power cord. The issue gets noticed, we respond to correct it, we clean up any impacts, and we’re back in business.

Occasionally, a catastrophic loss occurs. A plane crashes, a patient dies during an operation, an attacker installs ransomware on the network. We often look for a single cause or freak occurrence to explain the incident. Rarely, if ever, are these accurate.

Read More

Thoughts on “A Message to Garcia”

“A Message to Garcia” is a brief essay on the value of initiative and hard work written by Elbert Hubbard in 1898. It is often assigned in leadership courses, particularly in the military. Less often assigned but providing essential context is Col. Andrew Rowan’s first-person account of the mission, “How I Carried the Message to Garcia”.

There are also a number of opinion pieces archived in newspapers and posted on the internet both heralding and decrying the essay. There are a number of interpretations and potential lessons to be extracted from this story. It’s important that developing leaders find the valuable ideas.

Work ethic

Hubbard’s original essay is something of a rant on the perceived scarcity of work ethic and initiative in the ranks of employees. He holds Rowan up as an example of the rare person who is dedicated to achieving his task unquestioningly and no matter the cost.

Of course, this complaint is not unique to Hubbard1 nor is it shared universally. Your view on this theme probably depends on whether you are a manager or worker and your views on the value of work2. Nevertheless, Hubbard’s point is clear: Strong work ethic is valuable and will be rewarded.

No questions asked

If that were the extent of the message, it would be an interesting read but not particularly compelling. One reason the essay gained so much traction is Hubbard’s waxing about how Rowan supposedly carried out his task: with little information, significant ingenuity, and no questions asked. This message appeals to a certain type of ‘leader’ who doesn’t think highly of their subordinates.

It’s also totally bogus.

Lt. Rowan was a well-trained Army intelligence officer and he was sufficiently briefed on the mission. Relying on his intelligence background, he understood the political climate and implications. Additionally, preparations were made for allied forces to transport him to Garcia. He did not have to find his own way and blindly search Cuba to accomplish his objective.

I don’t intend to minimize Rowan’s significant effort and achievement, only to point out Hubbard’s misguided message. Hubbard would have us believe that Rowan succeeded through sheer determination, when the truth is that critical thinking and understanding were his means.

There may be a time and place for blind execution, but the majority of modern work calls for specialized skills and critical thinking. Hubbard seems to conflate any question with a stupid question, which is misguided. We should encourage intelligent questions and clarifications to ensure that people can carry out their tasks effectively. After all, if Rowan didn’t have the resources to reach Garcia he may still be wandering Cuba and Spain may still be an empire.

The commander who dismisses all questions breeds distrust and dissatisfaction. Worse, they send their troops out underprepared.


On the topic of work ethic, Hubbard is preaching to the choir. Those with work ethic already have it while those with is won’t be swayed by the message. Of course, managers always desire employees who demonstrate work ethic.

“A Message to Garcia” would be more effectively viewed as a treatise on leadership. After all, Army leadership effectively identified, developed, and utilized Rowan’s potential.

Perhaps the most important lesson, understated in the essay, is choosing the right person for the job. Rowan had the right combination of determination, brains, and knowledge to get the job done. In another situation, he may have been the worst person. How did Col. Wagner know about Rowan and decide he was the right person for the job? How do we optimize personnel allocation in our own organizations?

That’s my two pesetas, now you chime in below. What lessons do you take from Hubbard’s essay? Feel free to link to an interpretation, criticism, or praise which resonates with you.

It’s time to get rid of specialty engineering: A criticism of the INCOSE Handbook

Chapter 10 of the INCOSE Systems Engineering Handbook covers “Specialty Engineering”. Take a look at the table of contents below. It’s a hodge-podge of roles and skillsets with varying scope.

Table of contents for the Specialty Engineering section of the INCOSE handbook.
Table of contents for the Specialty Engineering section of the INCOSE handbook.

There doesn’t seem to be rhyme or reason to this list of items. Training Needs Analysis is a perfect example. There’s no doubt that it’s important, but it’s one rather specific task and not a field unto itself. If you’re going to include this activity, why not its siblings Manpower Analysis and Personnel Analysis?

On the other hand, some of the items in this chapter are supposedly “integral” to the engineering process. This is belied by the fact that they’re shunted into this separate chapter at the end of the handbook. In practice, too, they’re often organized into a separate specialty engineering group within a project.

This isn’t very effective.

Many of these roles really are integral to systems engineering. Their involvement early on in each relevant process ensures proper planning, awareness, and execution. They can’t make this impact if they’re overlooked, which often happens when they’re organizationally separated from the rest of the systems engineering team. By including them in the specialty engineering section along with genuinely tangential tasks, INCOSE has basically stated that these roles are less important to the success of the project.

The solution

The solution is simple: re-evaluate and remove, or at least re-organize, this section of the handbook.

The actual systems engineering roles should be integrated into the rest of the handbook. Most of them already are mentioned throughout the document. The descriptions of each role currently in the specialty engineering section can be moved to the appropriate process section. Human systems integration, for example, might fit into “Technical Management Processes” or “Cross-Cutting Systems Engineering Methods”.

The tangential tasks, such as Training Needs Analysis, should be removed from the handbook altogether. These would be more appropriate as a list of tools and techniques maintained separately online, where it can be updated frequently and cross-referenced with other sources.

Of course, the real impact comes when leaders internalize these changes and organize their programs to effectively integrate these functions. That will come with time and demonstrated success.

The Boeing 737 Max crashes represent a failure of systems engineering

The 737 is an excellent airplane with a long history of safe, efficient service. Boeing’s cockpit philosophy of direct pilot control and positive mechanical feedback represents excellent human factors1. In the latest generation, the 737 Max, Boeing added a new component to the flight control system which deviated from this philosophy, resulting in two fatal crashes. This is a case study in the failure of human factors engineering and systems engineering.

The 737 Max and MCAS

You’ve certainly heard of the 737 Max, the fatal crashes in October 2018 and March 2019, and the Maneuvering Characteristics Augmentation System (MCAS) which has been cited as the culprit. Even if you’re already familiar, I highly recommend these two thorough and fascinating articles:

  • Darryl Campbell at The Verge traces the market pressures and regulatory environment which led to the design of the Max, describes the cockpit activities leading up to each crash, and analyzes the information Boeing provided to pilots.
  • Gregory Travis at IEEE Spectrum provides a thorough analysis of the technical design failures from the perspective of a software engineer along with an appropriately glib analysis of the business and regulatory environment.

Typically I’d caution against armchair analysis of an aviation incident until the final crash investigation report is in. However, given the availability of information on the design of the 737 Max, I think the engineering failures are clear even as the crash investigations continue.

Hazard analysis

The most glaring, obvious, and completely inexplicable design choice was a lack of redundancy in the MCAS sensor inputs. Gregory Travis blames “inexperience, hubris, or lack of cultural understanding” on the part of the software team. That certainly seems to be the case, but it’s nowhere near the whole story.

There’s a team whose job it is to understand how the various aspects of the system work together: systems engineering2. One essential job of the systems engineer is to understand all of the possible interactions among system components, how they interact under various conditions, and what happens if any part (or combination of parts) fails. That last part is addressed by hazard analysis techniques such as failure modes, effects, and criticality analysis (FMECA).

The details of risk management may vary among organizations, but the general principles are the same: (1) Identify hazards, (2) categorize by severity and probability, (3) mitigate/control risk as much as practical and to an acceptable level, (4) monitor for any issues. These techniques give the engineering team confidence that the system will be reasonably safe.

FAA Safety Risk Management Process flowchart and Risk Categorization Matrix table
FAA Safety Risk Management Process and Risk Categorization Matrix from FAA Order 8040.4B, Safety Risk Management Policy.

On its own, the angle of attack (AoA) sensor is an important but not critical component. The pilots can fly the plane without it, though stall-protection, automatic trim, and autopilot functions won’t work normally, increasing pilot workload. The interaction between the sensor and flight control augmentation system, MCAS in the case of the Max, can be critical. If MCAS uses incorrect AoA information from a faulty sensor, it can push the nose down and cause the plane to lose altitude. If this happens, the pilots must be able to diagnose the situation and respond appropriately. Thus the probability of a crash caused by an AoA failure can be notionally figured as follows:

P(AoA sensor failure) × P(system unable to recognize failure) × P(system unable to adapt to failure) × P(pilots unable to diagnose failure) × P(pilots unable to disable MCAS) × P(pilots unable to safely fly without MCAS)

AoA sensors can fail, but that shouldn’t be much of an issue because the plane has at least two of them and it’s pretty easy for the computers to notice a mismatch between them and also with other sources of attitude data such as inertial navigation systems. Except, of course, that the MCAS didn’t bother to cross-check; the probability of the Max failing to recognize and adapt to a potential AoA sensor failure was 100%. You can see where I’m going with this: the AoA sensor is a single point of failure with a direct path through the MCAS to the flight controls. Single point of failure and flight controls in the same sentence ought to give any engineer chills.

The next link in our failure chain is the pilots and their ability to recognize, diagnose, and respond to the issue. This implies proper training, procedures, and understanding of the system. From the news coverage, it seems that pilots were not provided sufficient information on the existence of MCAS and how to respond to its failure. Systems and human factors engineers, armed with a hazard analysis, should have known about and addressed this potential contributing factor to reduce the overall risk.

Finally, there’s the ability of the pilots to disable and fly without MCAS. The Ethiopian Airlines crew correctly diagnosed and responded to the issue but the aerodynamic forces apparently prevented them from manually correcting it. The ability to override those forces, plus the time it takes to correct the flight path, should have been part of the FMECA analysis.

I have no specific knowledge of the hazard analyses performed on the 737 Max. Based on recent events, it seems that the risk of this type of failure was severely underestimated or went unaddressed. Either one is equally poor systems engineering.

Cockpit human factors

An inaccurate hazard analysis, though inexcusable, could be an oversight. Compounding that, Boeing made a clear design decision in the cockpit controls which is hard to defend.

In previous 737 models, pilots could quickly override automatic trim control by yanking back on the yoke, similar to disabling cruise control in a car by hitting the brake. This is great human factors and it fit right in with Boeing’s cockpit philosophy of ensuring that the human was always in ultimate control. This function was removed in the Max.

As both the Lion Air and Ethiopian Airlines crew experienced, the aerodynamic forces being fed into the yoke are too strong for the human pilots to overcome. When MCAS directs the nose to go down, the nose goes down. Rather than simply control the airplane, Max pilots first have to disable the automated systems. Comparisons to HAL are not unwarranted.

In summary

Boeing is developing a fix for MCAS. It will include redundancy in AoA sensor inputs, not activating MCAS if the sensors disagree, MCAS activating only once per high-angle indication (i.e. not continuously activating after the pilots have given contrary commands), and limiting the feedback forces into the control yoke so that they aren’t stronger than the pilots. This functionality should have been part of the system to begin with.

Along with these fixes, Boeing is likely3 also re-conducting a complete hazard analysis of MCAS and other flight control systems. Boeing and the FAA should not clear the type until the hazards are completely understood, controlled, quantified, and deemed acceptable.

Many news stories frame the 737 Max crashes in terms of the market and regulatory pressures which resulted in the design. While I don’t disagree, these are not an excuse for the systems engineering failures. The 737 Max is a valuable case study for engineers of all types in any industry, and for systems engineers in high-risk industries in particular.

Visiting an operational missile cruiser

I was recently offered an incredible opportunity to spend a day aboard an operational U.S. Navy ship, meeting the crew and observing their work as they conducted a live fire exercise. The experience blew me away1. I came away with new appreciation for our surface forces as well as observations relevant for defense acquisition policy and systems engineering.

Naval Base San Diego

Americans are more disconnected than ever before from their military. To help develop awareness of the Navy’s role, Naval Surface Force Pacific occasionally invites community leaders to visit the fleet in San Diego. I felt very fortunate to be included as one of eight participants in an impressive group including business leaders, community leaders, and a district court judge who created a successful veterans treatment court.

The day began with a tour of San Diego Bay and the many ships docked at Naval Base San Diego. Our guide was Captain Christopher Engdahl, Chief of Staff of the Naval Surface Force Pacific.

White ship with red cross plus two other ships
The hospital ship USNS Mercy (T-AH-19) and two other ships docked at Naval Base San Diego.

As we cruised around the bay, Captain Engdahl described the role of the surface force (surface being distinct from the aviation and submarine forces). Like most of our military forces, the surface force has a diverse mission set. Naval warfare has changed significantly from the large-scale, fleet vs. fleet battles of centuries past.

Recently, the surface force has been emphasizing projection of power, freedom of navigation operations, anti-piracy missions, and humanitarian aid. The surface force also supports air and land operations with forward deployment platforms, fire support, and direct enemy engagement. I haven’t even touched on mines, anti-submarine, electronic warfare, intelligence gathering, and countless other roles.

USS Independence (LCS-2) and USS Comstock (LSD-45) docked in Naval Base San Diego.
USS Independence (LCS-2) and USS Comstock (LSD-45) docked at Naval Base San Diego.

As we cruised by the piers, Captain Engdahl exhibited an encyclopedic knowledge of each of the ships we passed. As he spoke, he sprinkled in stories and details gained from an impressive Navy career. He was recently nominated to Rear Admiral and expects to be assigned to the Board of Inspection and Survey, which assesses the condition of Navy ships and reports to Congress.

Speaking of Congress and readiness, acquisition challenges seem to plague the Navy. These issues have been well publicized. Take the woefully over-budget and under-performing Littoral Combat Ships (pictured above is the USS Independence, featuring a trimaran hull). Another example is the futuristic-looking Zumwalt-class (photo below), which saw its initial 32-unit plan cut to just three amid ballooning costs and watered-down capabilities. The Ticonderoga-class cruisers are reaching the end of their service life and the replacement program is both vaguely-defined and on an aggressive timeline; the results remain to be seen. Given this recent track record, it’s hard to imagine the Navy fulfilling its plan to expand the fleet from 285 to 355 ships.

USS Michael Monsoor (DDG-1001), USS Cape St. George (CG-71), and USS Sterett (DG-104) docked and undergoing maintenance in Naval Base San Diego.
USS Michael Monsoor (DDG-1001), USS Cape St. George (CG-71), and USS Sterett (DG-104) docked and undergoing maintenance at Naval Base San Diego. The white tarps covering parts of the ships help prevent environmental contamination.

Captain Engdahl touched on these concerns but didn’t dwell on them. He did express concern regarding the military’s struggle with recruitment and retention. Many young Americans don’t meet the physical requirements or don’t view the Navy as a viable career. Top personnel often leave after a few tours to work in industry, which can offer more lucrative compensation and work-life balance. Though the Navy has been making significant strides on retention, personnel will likely remain a perennial issue.

USS Bunker Hill (CG-52)

After our tour of the bay, we headed to Naval Air Station North Island to catch a flight to the USS Bunker Hill (CG-52). Bunker Hill is a Ticonderoga-class guided-missile cruiser. She is assigned to Carrier Strike Group Nine, which includes the aircraft carrier USS Theodore Roosevelt (CVN 71). Though she has reached the end of her service life, the Navy has committed to maintaining her for several more years.

This was evident in the condition of the ship. Bunker Hill had just completed a maintenance and refurbishment period and was in the process of re-certifying the equipment and crew. The flight deck, just barely large enough for the MH-60 Seahawk which was our ride, had been certified the day before. On the day of our visit, the crew was conducting a live fire exercise to certify the ship’s two five-inch guns.

MH-60 helicopter on the tarmac
Our chariot at NAS North Island.

I almost fell over with the rocking of the ship in the ocean swells. Captain Kurt Sellerberg and his executive officer Commander David Sandomir welcomed us aboard. It was lunchtime and we were invited to enjoy burgers2 with several of the ship’s officers in the wardroom. The food was good, the coffee was strong, and the officers were proud of their ship.

Segment of wooden flight deck in a curio cabinet.
The wardroom contains a cabinet with artifacts related to the ship. This is a section of flight deck from the carrier USS Bunker Hill (CV-17), decommissioned in 1947.

After lunch, we were each assigned a petty officer to guide us around. With the recruitment discussion fresh on my mind, I asked my guide why she chose to enlist in the Navy. She told me that she had been interested in becoming an engineer, but for personal reasons college had not been an option. We didn’t get the opportunity to discuss her future plans (the big guns started firing), but she clearly was the type of dedicated, knowledgeable sailor the Navy wants to retain; if she decides to pursue an engineering degree, she’d be heavily recruited by defense contractors.

In fact, every sailor and officer I met aboard the ship was a model professional. We had the opportunity to tour the medical facilities, the engine and generator rooms, the engineering plant central control station, berthing and hygiene facilities, the torpedo room, helicopter hangar, and firefighting facilities. By interacting with the crew, I gained a sense of the culture onboard, which I would describe as strong camaraderie and trust. I later learned that we were witnessing the first implementation of a new training strategy which allows them to complete their basic certification early and utilize the remaining time for more advanced exercises.

Today, the crew was demonstrating the 5-inch gun by firing at imaginary targets on San Clemente Island3. We had free range of the upper-level deck and bridge during the live-fire exercise, an almost-unbelievable amount of access. We listened to the radio calls as the forward observer on the island called in firing coordinates and we watched the gun aim and fire in response. The evaluation team on the island recorded data on each round to score the exercise. We witnessed illumination rounds4, spotting rounds, and the rapid “fire for effect”.

Between the forward and rear guns and multiple test scenarios, about 150 rounds were fired. You can see a few of them in the video below. All of the goals for the objectives were satisfied and Captain Sellerberg came on the 1MC (PA system5) to congratulate the crew on a successful exercise.

Heading home

Finally, it was time to head home. The captain and executive officer spent a few more minutes chatting with us while the helicopter landed and refueled. We said farewell and were off6.

View of San Diego and the Coronado bridge from the air
Coronado and San Diego from the air.

The flight back provided an opportunity for reflection on the day. Beyond being seriously impressed by the exercise and the crew, there are concrete lessons to be learned:

Engineers need field trips

You can read about the [Air Force/Army/Coast Guard/Marines/Navy/Space Force7] ’til the cows come home. But it can’t compare to observing and participating in the culture first-hand. In-person visits to an operational facility build unmatched user empathy and mission understanding. On each project, engineering teams need to take the time to visit their users and spend a few days observing their work; leaders in both the contractor and customer organizations need to support these visits.

Traditions matter

The Navy is proud of their traditions. The ship’s brass bell is rung every half hour and water fountains are called scuttlebutts. Traditions provide continuity, reminding us of our history even as we adapt to the future.

As far as I am aware, there is no psychology research on the effects of tradition on performance. I would venture to guess that tradition is highly correlated with culture and organizational learning in high-risk and high-performing organizations8. In this sense, tradition substitutes for shared experience.

In the military, traditions are intentionally-instilled doctrine. In engineering, tradition varies significantly by domain and organization. Engineering is evolving more rapidly than ever, and I think it’s important that we carry forward traditions and institutional knowledge even as we innovate.

Innovate with intention

Bunker Hill may be 35 years old, but you’d be hard-pressed to see signs of her age. Her crew may be young, but you’d be hard-pressed to see signs of immaturity. The Navy relies on centuries of experience with maintaining ships and training new sailors. They know what works and what doesn’t.

Meanwhile, industry gets excited about every hot new buzzword. We breathlessly promote blockchains, machine learning, artificial intelligence, and electromagnetically-launched projectiles. We shoehorn technologies into projects for the sake of innovation and not because it’s what the system really needs. Innovation is essential, but should be done with care and intention, not novelty.

Bravo Zulu

I can’t say enough about the men and women I interacted with during this experience. They represent the Navy and our country with dedication, skill, and professionalism. This experience gave me a renewed sense of pride in the work we do in the defense industry. Thanks again to everyone who took the time to share their world with us: you made an indelible impression on myself and the entire group.

System lexicons and why your project needs one

A system lexicon is a simple tool which can have a big impact on the success of the system. It aligns terminology among technical teams, the customer, subcontractors, support personnel, and end users. This creates shared understanding and improves consistency. Read on to learn how to implement this powerful tool on your program.

Read More

An Engineering Touchstone to Enable Successful Designs

Successful systems are created by engineers who understand and design to the ultimate objectives of the project. When we lose sight of those objectives we start making design decisions based on the wrong criteria and thus create sub-optimal designs. Scope creep, group think, and simple convenience are frequent causes of this type of variation. An effective design assessment tool is a touchstone by which we can evaluate the effectiveness of ongoing design decisions and keep the focus on the optimal solution.

Read More

Agile Government Contracts

Agile is a popular and growing software development approach. It promotes a focus on the product rather than the project plan. This model is very attractive for many reasons and teams are adopting it across the defense industry. However, traditional government contracts and project management are entirely plan-driven. Can you really be agile in a plan-driven world?

Read More