Failures and errors happen frequently. A part breaks, an instruction is misunderstood, a rodent chews through a power cord. The issue gets noticed, we respond to correct it, we clean up any impacts, and we’re back in business.
Occasionally, a catastrophic loss occurs. A plane crashes, a patient dies during an operation, an attacker installs ransomware on the network. We often look for a single cause or freak occurrence to explain the incident. Rarely, if ever, are these accurate.
Chapter 10 of the INCOSE Systems Engineering Handbook covers “Specialty Engineering”. Take a look at the table of contents below. It’s a hodge-podge of roles and skillsets with varying scope.
There doesn’t seem to be rhyme or reason to this list of items. Training Needs Analysis is a perfect example. There’s no doubt that it’s important, but it’s one rather specific task and not a field unto itself. If you’re going to include this activity, why not its siblings Manpower Analysis and Personnel Analysis?
On the other hand, some of the items in this chapter are supposedly “integral” to the engineering process. This is belied by the fact that they’re shunted into this separate chapter at the end of the handbook. In practice, too, they’re often organized into a separate specialty engineering group within a project.
This isn’t very effective.
Many of these roles really are integral to systems engineering. Their involvement early on in each relevant process ensures proper planning, awareness, and execution. They can’t make this impact if they’re overlooked, which often happens when they’re organizationally separated from the rest of the systems engineering team. By including them in the specialty engineering section along with genuinely tangential tasks, INCOSE has basically stated that these roles are less important to the success of the project.
The solution is simple: re-evaluate and remove, or at least re-organize, this section of the handbook.
The actual systems engineering roles should be integrated into the rest of the handbook. Most of them already are mentioned throughout the document. The descriptions of each role currently in the specialty engineering section can be moved to the appropriate process section. Human systems integration, for example, might fit into “Technical Management Processes” or “Cross-Cutting Systems Engineering Methods”.
The tangential tasks, such as Training Needs Analysis, should be removed from the handbook altogether. These would be more appropriate as a list of tools and techniques maintained separately online, where it can be updated frequently and cross-referenced with other sources.
Of course, the real impact comes when leaders internalize these changes and organize their programs to effectively integrate these functions. That will come with time and demonstrated success.
The 737 is an excellent airplane with a long history of safe, efficient service. Boeing’s cockpit philosophy of direct pilot control and positive mechanical feedback represents excellent human factors1. In the latest generation, the 737 Max, Boeing added a new component to the flight control system which deviated from this philosophy, resulting in two fatal crashes. This is a case study in the failure of human factors engineering and systems engineering.
The 737 Max and MCAS
You’ve certainly heard of the 737 Max, the fatal crashes in October 2018 and March 2019, and the Maneuvering Characteristics Augmentation System (MCAS) which has been cited as the culprit. Even if you’re already familiar, I highly recommend these two thorough and fascinating articles:
Darryl Campbell at The Verge traces the market pressures and regulatory environment which led to the design of the Max, describes the cockpit activities leading up to each crash, and analyzes the information Boeing provided to pilots.
Gregory Travis at IEEE Spectrum provides a thorough analysis of the technical design failures from the perspective of a software engineer along with an appropriately glib analysis of the business and regulatory environment.
Typically I’d caution against armchair analysis of an aviation incident until the final crash investigation report is in. However, given the availability of information on the design of the 737 Max, I think the engineering failures are clear even as the crash investigations continue.
The most glaring, obvious, and completely inexplicable design choice was a lack of redundancy in the MCAS sensor inputs. Gregory Travis blames “inexperience, hubris, or lack of cultural understanding” on the part of the software team. That certainly seems to be the case, but it’s nowhere near the whole story.
There’s a team whose job it is to understand how the various aspects of the system work together: systems engineering2. One essential job of the systems engineer is to understand all of the possible interactions among system components, how they interact under various conditions, and what happens if any part (or combination of parts) fails. That last part is addressed by hazard analysis techniques such as failure modes, effects, and criticality analysis (FMECA).
The details of risk management may vary among organizations, but the general principles are the same: (1) Identify hazards, (2) categorize by severity and probability, (3) mitigate/control risk as much as practical and to an acceptable level, (4) monitor for any issues. These techniques give the engineering team confidence that the system will be reasonably safe.
On its own, the angle of attack (AoA) sensor is an important but not critical component. The pilots can fly the plane without it, though stall-protection, automatic trim, and autopilot functions won’t work normally, increasing pilot workload. The interaction between the sensor and flight control augmentation system, MCAS in the case of the Max, can be critical. If MCAS uses incorrect AoA information from a faulty sensor, it can push the nose down and cause the plane to lose altitude. If this happens, the pilots must be able to diagnose the situation and respond appropriately. Thus the probability of a crash caused by an AoA failure can be notionally figured as follows:
P(AoA sensor failure) × P(system unable to recognize failure) × P(system unable to adapt to failure) × P(pilots unable to diagnose failure) × P(pilots unable to disable MCAS) × P(pilots unable to safely fly without MCAS)
AoA sensors can fail, but that shouldn’t be much of an issue because the plane has at least two of them and it’s pretty easy for the computers to notice a mismatch between them and also with other sources of attitude data such as inertial navigation systems. Except, of course, that the MCAS didn’t bother to cross-check; the probability of the Max failing to recognize and adapt to a potential AoA sensor failure was 100%. You can see where I’m going with this: the AoA sensor is a single point of failure with a direct path through the MCAS to the flight controls. Single point of failure and flight controls in the same sentence ought to give any engineer chills.
The next link in our failure chain is the pilots and their ability to recognize, diagnose, and respond to the issue. This implies proper training, procedures, and understanding of the system. From the news coverage, it seems that pilots were not provided sufficient information on the existence of MCAS and how to respond to its failure. Systems and human factors engineers, armed with a hazard analysis, should have known about and addressed this potential contributing factor to reduce the overall risk.
Finally, there’s the ability of the pilots to disable and fly without MCAS. The Ethiopian Airlines crew correctly diagnosed and responded to the issue but the aerodynamic forces apparently prevented them from manually correcting it. The ability to override those forces, plus the time it takes to correct the flight path, should have been part of the FMECA analysis.
I have no specific knowledge of the hazard analyses performed on the 737 Max. Based on recent events, it seems that the risk of this type of failure was severely underestimated or went unaddressed. Either one is equally poor systems engineering.
Cockpit human factors
An inaccurate hazard analysis, though inexcusable, could be an oversight. Compounding that, Boeing made a clear design decision in the cockpit controls which is hard to defend.
In previous 737 models, pilots could quickly override automatic trim control by yanking back on the yoke, similar to disabling cruise control in a car by hitting the brake. This is great human factors and it fit right in with Boeing’s cockpit philosophy of ensuring that the human was always in ultimate control. This function was removed in the Max.
As both the Lion Air and Ethiopian Airlines crew experienced, the aerodynamic forces being fed into the yoke are too strong for the human pilots to overcome. When MCAS directs the nose to go down, the nose goes down. Rather than simply control the airplane, Max pilots first have to disable the automated systems. Comparisons to HAL are not unwarranted.
Boeing is developing a fix for MCAS. It will include redundancy in AoA sensor inputs, not activating MCAS if the sensors disagree, MCAS activating only once per high-angle indication (i.e. not continuously activating after the pilots have given contrary commands), and limiting the feedback forces into the control yoke so that they aren’t stronger than the pilots. This functionality should have been part of the system to begin with.
Along with these fixes, Boeing is likely3 also re-conducting a complete hazard analysis of MCAS and other flight control systems. Boeing and the FAA should not clear the type until the hazards are completely understood, controlled, quantified, and deemed acceptable.
Many news stories frame the 737 Max crashes in terms of the market and regulatory pressures which resulted in the design. While I don’t disagree, these are not an excuse for the systems engineering failures. The 737 Max is a valuable case study for engineers of all types in any industry, and for systems engineers in high-risk industries in particular.
A system lexicon is a simple tool which can have a big impact on the success of the system. It aligns terminology among technical teams, the customer, subcontractors, support personnel, and end users. This creates shared understanding and improves consistency. Read on to learn how to implement this powerful tool on your program.
Successful systems are created by engineers who understand and design to the ultimate objectives of the project. When we lose sight of those objectives we start making design decisions based on the wrong criteria and thus create sub-optimal designs. Scope creep, group think, and simple convenience are frequent causes of this type of variation. An effective design assessment tool is a touchstone by which we can evaluate the effectiveness of ongoing design decisions and keep the focus on the optimal solution.
The application of human systems integration (HSI) throughout a project results in improved system performance, reduced lifecycle cost, reduced development risk, and no increase in development cost when executed effectively.
The title of this post is slightly misleading. The Department of Defense (DoD) doesn’t just influence systems engineering practice, it practically dictates it. And the reason is simple: The DoD acquires more types of systems and more complicated systems than any other organization in the world1Read More
Posts on this site, including text and images, are copyrighted. Background and site images are either in the public domain or used under a Creative Commons license.