Reliability Engineering and System Safety 74 (2001) 345±352 www.elsevier.com/locate/ress Probabilistic risk analysis for the NASA space shuttle: a brief history and current work Elisabeth PateÂ-Cornell a,*, Robin Dillon b a Department of Management Science and Engineering, Stanford University, Stanford, CA 94305, USA b Department of Management Science and Information Technology, Pamplin College of Business, Virginia Tech, Falls Church, VA22043, USA Received 2 June 2000; accepted 13 December 2000 Abstract While NASA managers have always relied on risk analysis tools for the development and maintenance of space projects, quantitative and especially probabilistic techniques have been gaining acceptance in recent years. In some cases, the studies have been required, for example, to launch the Galileo spacecraft with plutonium fuel, but these successful applications have helped to demonstrate the bene®ts of these tools. This paper reviews the history of probabilistic risk analysis (PRA) by NASA for the space shuttle program and discusses the status of the on- going development of the Quantitative Risk Assessment System (QRAS) software that performs PRA. The goal is to have within NASA a tool that can be used when needed to update previous risk estimates and to assess the bene®ts of possible upgrades to the system. q 2001 Elsevier Science Ltd. All rights reserved. Keywords: Probabilistic risk analysis; space shuttle 1. Introduction turned away from quantitative risk assessment methods. A few years later, however, failure probabilities such as For a long time, NASA managers have viewed probabil- 10 25 or 10 26 per ¯ight were often casually quoted without istic risk analysis (PRA) and expected-utility decision much justi®cation, either for subsystems or for entire analysis with some suspicion, and many still do. The agency missions, to express and support NASA's con®dence in often operates new or improved systems, and there is their performances [2]. seldom an abundance of statistics to describe past perfor- Rather than quantifying failure probabilities, the agency mance. Probabilistic risk analysis, however, is most useful has generally preferred qualitative analyses such as Failure when little statistical data are available to assess the failure Mode and Effect Analysis (FMEA), Critical Item Lists probability of a whole system. In these cases, it is often (CILs), and Risk Matrices [3±5]. FMEA/CIL relies on the useful to decompose the system into subsystems and logical identi®cation of a system's weak points and of fail- components to quantify the overall failure risk as a function ure/event combinations (cut-sets) leading to its catastrophic of the system's architecture and of the probabilities of fail- failure. Risk matrices usually include, for different compo- ure of the different elements for which more data are gener- nents or subsystems, qualitative information and corre- ally available. sponding scale indices about the likelihood of failure At the onset of the Apollo program, NASA seemed to events (e.g., high, medium or low) and the severity of have accepted the notion that quantitative risk analysis their consequences (e.g., high, medium, or low). These could be useful for decision support [1]. But the failure matrices are often used as ®lters to decide which are the probabilities computed for some missions of the Apollo highest priority technical problems. A major dif®culty program were largely overestimated because they were when using risk matrices is to combine such information based on conservative estimates of subsystem failure risks. about the different components to characterize the robust- Because the results were so pessimistic and showed such a ness of the whole system. small probability of mission success, NASA at that time, Since the Challenger accident, however, the use of PRA at NASA has increased signi®cantly, not only for the space * Corresponding author. shuttle but also for some unmanned space missions and for E-mail addresses: mep@leland.stanford.edu (E. PateÂ-Cornell), the space station [6±17]. Probabilistic models have been dillon@vt.edu (R. Dillon). and are being developed to assess the risk contribution of 0951-8320/01/$ - see front matter q 2001 Elsevier Science Ltd. All rights reserved. PII: S 0951-832 0(01)00081-3 346 E. PateÂ-Cornell, R. Dillon / Reliability Engineering and System Safety 74 (2001) 345±352 speci®c shuttle problems and the bene®ts of shuttle consequences. All single-point failures that cannot be elimi- upgrades. These models are currently implemented through nated are collected in a Critical Items List. The process by the software called QRAS (Quantitative Risk Assessment which this list is established is a static, qualitative, bottom- System), itself in its development phase. Therefore, the up approach, focused on the identi®cation and reduction of current shuttle PRA model is both a motivation and a the risk of single, independent component failures that can product of this development. The goal is to have within cause the loss of crew, vehicle, or mission [1,19±22]. In the NASA a tool that can be used when needed to update critical item list, components are listed according to their previous risk estimates and to assess the bene®ts of possible level of criticality. Criticality 1 characterizes an element upgrades. whose failure is suf®cient to cause overall shuttle failure This paper describes brie¯y the history of shuttle PRA de®ned by loss of vehicle and crew (LOVC). Criticality and the current efforts to develop a complete risk analysis 1R is when there is one redundancy to prevent LOVC. model at the same time as the QRAS software. It is based in The problem is that the level of criticality alone is not an part on studies such as Fragola [1] and on the results of a indicator of the contribution of a particular component to the recent panel review of current NASA PRA work for the overall failure risk. A Criticality 1 component with a low shuttle [18]. failure probability can be less threatening to a system's safety than several components in parallel, each with a high probability of failure, especially if these failures are 2. PRA for NASA space shuttle: a brief history highly dependent. Regardless of its shortcomings, FMEA is a ®rst step towards a PRA and the complete description of 2.1. Prior to the Challenger accident all accident sequences. The next step is the assessment of the probability of system failure per time unit or per opera- As mentioned above, at the onset of the Apollo program, tion as a function of the probabilities of relevant events NASA generally accepted the notion of using risk analysis including component failures, accounting for dependencies to choose some mission features and to compare the results and in particular, those caused by external events. to safety benchmarks [1]; but during the Apollo program, In addition to qualitative studies, some quantitative risk pessimistic risk estimates discouraged the agency to adopt analyses had been performed for the space shuttle before the quantitative risk analysis. As is often the case, the problem Challenger accident during ¯ight 51L in January 1986, (see was that conservative values (as opposed to means) of future for example, Baker [23]). These were small-budget studies, failure frequencies were used to account for uncertainties, highly constrained and limited by the de®nition of their instead of a full uncertainty analysis. In truth, the methods scope and by restrictions on data sources. NASA mandated were in their infancy and the software needed did not exist. that in addition to the limited data available from past The conservative approach seemed prudent but the results performances, the analysts should use as data the opinions were both alarming and discouraging. As usual when it is of NASA's experts regarding the performance of speci®c clear that they contradict the facts, wrong results became systems. These experts, for example, estimated the probabil- detrimental to the acceptance of the quantitative risk ity of a Solid Rocket Booster (SRB) failure at 10 25 per ¯ight analysis method itself. without any formal systems analysis needed to support such Furthermore, notions of failure risk and failure probabil- an estimate. It is on that basis that early risk analyses for the ity often clashed (and still do) with the engineering culture, shuttle system indicated a probability of LOVC in the order primarily based on safety factors, which can be de®ned in of one in several thousand per ¯ight. Consequently, this many different ways, and include, for example, design for time, they signi®cantly underestimated the failure risk. higher loads than anticipated. Useful as they may be in Yet, the astronauts knew from their experience with minor decreasing failure risks, the problem with safety factors is failures in ¯ight (e.g., of a particular switch in the crew that they are not directly linked to the probability of system cabin) that the risks were probably much higher. Feynman, failure and the relationship may vary across systems. For in his Appendix to the Rogers Commission report [24] example, a safety factor of two in one system may not imply emphasized the problems with such estimates and the down- the same safety level as in another. Therefore, safety factors side of overcon®dence. Indeed, Kaplan [25] showed that a alone are insuf®cient to support cost-effective allocation of Bayesian analysis based on the shuttle `near-misses' prior to upgrade resources and prioritization of retro®ts across the Challenger accident could have indicated a much higher different systems. failure probability. Rather than quantitative risk estimates, shuttle program managers preferred the use of Failure Modes and Effects 2.2. Post-Challenger risk analyses Analysis and Critical Items List (FMEA/CIL) to identify a system's weak points and to manage failure risks accord- The Challenger accident drastically altered this optimism. ingly. FMEA is useful to the extent that it indicates which NASA and its contractors performed a major review of all possible design changes might eliminate a failure mode, shuttle FMEAs and updated the Critical Item List. The reduce its future frequency to a lower level, or mitigate its immediate result was a large increase in the number of E. PateÂ-Cornell, R. Dillon / Reliability Engineering and System Safety 74 (2001) 345±352 347 critical items from 2,369 to 4,686 [26]. In addition, the (reference to original study, see Fragola [1]). These studies Rogers Commission [24] that investigated the accident also showed that components of criticality higher than 1 (i.e. concluded that the perceptions of shuttle failure risks had less critical to LOVC) could contribute as much as 30% of been overly optimistic and that the risk assessment methods the overall failure risk. This implies that limiting a PRA to needed improvement. Furthermore, priorities needed to be Criticality 1 items is likely to lead to an underestimation of set among risk mitigation measures given the limitations of the risk. NASA's resources. Qualitative methods that had provided Shortly after the completion of these proof-of-concept general guidance for risk management were inadequate for studies, NASA was required to fund an independent shuttle prioritization because they did not allow quanti®cation of PRA to support the approval of the launch of the Galileo the relative contributions of the different components to the mission because plutonium fuel was present on the space- probabilities of system failure. craft [30]. This study was limited to the ascent portion of the At about the same time, a National Research Council mission and focused primarily on scenarios that presented a (NRC) panel reviewed the risk assessment and management risk to the nuclear payload. It included a propagation of of the space shuttle program, and recommended quantitative uncertainties about the future frequencies of component approaches to set priorities among possible upgrades of failures, thus providing a probability distribution for the critical items. The NRC panel found that previous quanti®- future frequency of a shuttle accident or ¯ight abort. Despite cation of the shuttle risks were based almost exclusively on NASA's conclusion that the probability of shuttle failure subjective judgments and qualitative rationales, even could be high, (median estimate 1/78), this study concluded though quantitative engineering analyses and test data rele- that the risk to the public caused by plutonium contamina- vant to risk assessment were available and could have been tion was low. used [27]. Prior to approval of the launch of the Ulysses spacecraft In the late 1980s and early 1990s, probabilistic risk analy- from the shuttle, the shuttle risk analysis that was required sis (PRA) therefore seemed a better alternative to qualitative for the Galileo mission needed updating to consider any risk assessment. Yet, within NASA, there was still strong variations associated with this spacecraft since the previous resistance. First, the cost of a complete PRA seemed high. study. In 1993, NASA thus commissioned an update of the The value of information as decision support was not well Galileo study, using Bayesian techniques to integrate the understood, and it was sometimes stated that instead, the former risk estimates with the new evidence that had been same amount of money could be better invested in strength- gathered since the original report [31]. ening the system. The question of course is: where should Around the same time (the early 1990s), damage to that investment be made in priority, and how much will several of the tiles of the shuttle heat shield during previous eventually be gained by replacing intuition by quantitative missions prompted NASA's management to commission a decision support? Second, the use of Bayesian probability, study of the thermal protection system (TPS). This study which is often the only option given the systems' novelty, showed that the contribution of the black tiles that protect was often considered at NASA too `subjective' to be trusted the underside of the shuttle orbiter at re-entry was about for decision support. That was true until it became obvious 10% of the overall shuttle failure risk [32]. Only two tiles that there was no better alternative because by de®nition, had failed in ¯ight thus far, without causing damage to the extensive data sets did not exist. At the very least, these orbiter's skin. One failed because of a weak bond, and the methods allowed a systematic and consistent assessment other because it was hit by a piece of debris probably and treatment of the risk components. coming from the insulation of the external tank. That In the following years (1990s), a number of pilot studies study showed that 15% of the tiles were the source of and a ®rst attempt at a comprehensive shuttle PRA study 85% of the probability of a shuttle accident induced by were undertaken. In an initial attempt to incorporate prob- TPS failure. More importantly, on the basis of a simple abilistic risk analysis methods in its decision support, NASA ®rst-order risk analysis, it allowed ranking the tiles by commissioned two `proof-of-concept' studies. Their objec- order of risk contribution, and therefore, setting priorities tive was to determine if a PRA could identify high-risk areas in the tile inspection before each ¯ight [32,33]. that traditional FMEA/CIL and hazard analysis techniques Then, in 1995, NASA funded the ®rst attempt at a could not. One of these studies focused on the auxiliary comprehensive quantitative risk assessment including all power units (APUs), and the other on the main propulsion phases of a shuttle mission [34]. The method used was pressurization system [28,29]. These two studies showed similar to the PRA framework developed by the US Nuclear that the probabilities of failure of a small number of CIL Regulatory Commission [35±40] (Master Logic Diagram, items represented most of the shuttle failure risk, and that in fault tree and event tree analyses, etc.) to obtain the prob- addition, several important failure scenarios had not been ability of a major accident as a function of the probabilities identi®ed by NASA's previous analyses. Furthermore, the of component and subsystem failures [41]. Because of APU study demonstrated that the number of redundancies resource limitations, however, a number of components had to be weighed against the increase of risk of ®re and were assumed to contribute negligible additional risks and explosion caused by the possibility of a hydrazine leakage were not included in the analysis. In addition, some external 348 E. PateÂ-Cornell, R. Dillon / Reliability Engineering and System Safety 74 (2001) 345±352 accident initiators were left out, for example, the penetration version of the QRAS software and the performance of a by micrometeroids of systems other than the tiles. The PRA. It will result in a model that will provide an overall results showed a probability of a shuttle accident (LOVC) shuttle failure probability and will allow estimation of the between 1/76 and 1/230, which seemed consistent with the risk changes associated with proposed shuttle upgrades limited experience available at that time. Following that (e.g., an upgrade of the main engine turbopumps). study, NASA managers (as the Nuclear Regulatory Two separate teams are currently developing these risk Commission had done a few years before) decided to use analysis models. The ®rst one, at Johnson Space Center PRA as one of the bases for the support of decisions regard- (JSC) analyzes the orbiter and its main propulsion systems ing improvements in shuttle safety. They needed a tool to including the auxiliary power systems, hydraulic system, routinely perform shuttle PRAs, which had to be updated thrust vector control, and main propulsion system. The regularly to monitor risk variations and to evaluate the second team, at Marshall Space Flight Center (MSFC), is effects of changes in design and operation procedures. in charge of the other shuttle subsystems of the main This is the effort that is now underway and which we engines, the external tank, the solid rocket boosters, and describe further. the reusable solid rocket motors. These studies are designed Many lessons were already learned in these early PRAs; to be limited to Criticality 1 and 1R items, and generally for example: assume that failures of such items inevitably lead to an accident or mission failure. For analytical purposes, the 1. Conservative estimates should not be mixed with prob- system, as well as the PRA, have been divided into abilities that represent (for instance) mean future `modules'. The analysis is being done `bottom-up' on the frequencies of failures. Otherwise the results are mean- basis of these modules. Some links have been included to ingless and possibly counterproductive. ensure that an accident sequence that cuts across modules, 2. Guessing the probability of failure of a complex system and across analytical teams, are accounted. Yet, the current such as the SRBs is unlikely to lead to an accurate ®gure exercise is facing some of the classical dif®culties of coor- when the system can be analyzed to provide a better dinating a PRA study when the system has been divided for result. analytical purposes. Both teams are supposed to rely on 3. Near-miss events and partial failures can provide valu- QRAS while it is still in its development phase. Therefore, able information for the assessment of system failure at this stage, some elements of the PRA (e.g., fault tree risk, especially when a catastrophic failure has not yet analysis results, especially for the analysis of the orbiter) happened. are computed `off-line' independently from the existing 4. Restricting a PRA to Criticality 1 items is likely to lead to software. A panel of experts recently reviewed the current an underestimation of the failure risks. shuttle PRA efforts [18]. Some of the comments of this 5. Adding redundancies does not always improve the safety panel are described below. of the system (APUs, for example, introduce an added The computer software QRAS originated at NASA Head- risk of hydrazine leakage that has to be weighed against quarters in conjunction with the University of Maryland in the value of an extra redundancy). 1998 [42±44]. It is currently being developed by NASA 6. A top-down analysis is needed to capture the dependen- Headquarters and its contractors and subcontractors, includ- cies among system failures, for example between the ing Allied Signal and L&M Technology. QRAS aggregates debonding of debris from the insulation of the external subsystem failure mode probabilities from the bottom-up to tank and their effects on the tiles of the heat shield. produce intermediate and top-level catastrophic failure probabilities and bounds on the uncertainties. It is based on the identi®cation of a set of scenarios represented by 2.3. The current work on shuttle PRA and the QRAS event sequence diagrams (ESDs), starting with an initiating software event and ending with an accident, a ¯ight abort, or a benign outcome, either directly or through a sequence of intermedi- The studies mentioned above were all completed by inde- ate (`pivotal') events. Among the results is a prioritization of pendent consultants outside of NASA. In July 1996, the the subsystem failure modes that contribute most to the NASA administrator requested that an independent quanti- overall risk, and an evaluation (and ranking) of space shuttle tative analysis of the risk of a shuttle accident be conducted potential upgrades, both from a safety and a cost point of by internal NASA experts, and that supporting software be view [20]. The ®rst version of this software is currently used developed. The long-term objective is to use the results as at JSC to assess the failure risks of the shuttle orbiter, and decision support for shuttle upgrades. The chosen approach at MSFC to assess the failure risks of the other shuttle is to develop the Quantitative Risk Analysis System subsystems. (QRAS) software to perform PRA, permit its updating, Once the QRAS software is completed and available, and allow real-time support of decisions ranging from retro- NASA will be able to develop and upgrade PRA models ®t to launch under speci®ed circumstances. This ongoing at very detailed levels, integrating physical models of failure study involves, in parallel, the improvement of the ®rst processes into the logic model and the probabilistic analysis. E. PateÂ-Cornell, R. Dillon / Reliability Engineering and System Safety 74 (2001) 345±352 349 For the moment, however, the two space centers that are Challenger accident, the Rogers commission [24] showed charged of the development of PRA models using the ®rst that poor communications were at the source of the fatal version of the QRAS software have had mixed experiences decision to launch on that day. In the same way, during with it because it still misses important features. For exam- the earlier analysis of the tiles contribution to failure risks, ple, in its current version, QRAS treats accident sequences it was shown that part of the risk of tile failure could be independently from others even though some may have attributed to debris hits caused by the debonding of parts of common events. It does not include proper treatment of the insulation of the external tank [32]. Yet, the two systems external events and common causes of failures, and it are managed independently, the external tank at Marshal does not have the capability of building and analyzing Space Flight Center and the tiles at Kennedy Space Center, fault trees. Therefore, some of the current PRA work has and it took the chemical analysis of a missing tiles cavity to be done off-line, for instance using fault tree analyses that before the link was established. are not part of the current software, before integrating the results in the QRAS models. A consortium of industry 3.2. Analytical modules as opposed to an overarching model contractors, the United Space Alliance (USA), has been The role of an overarching model is to ensure the comple- charged with the space shuttle operations and is monitoring teness and the accuracy of a PRA, the inclusion of depen- the shuttle PRA work done both at JSC and MSFC. Its dencies across systems and of common events across objective is to use the results to support recommendations accident sequences, and the proper treatment of external for upgrades of the shuttle design as well as improvements events that can affect simultaneously several subsystems. of maintenance and processing operations. This requires a top-down approach starting from a systema- tic analysis of accident sequences, or conjunctions of events 3. Some characteristics of the current PRA modeling leading to failure. An overarching model can be based on efforts for the space shuttle different tools such as the Master Logic Diagram developed and used in the nuclear power industry, a complete event In a recent review of the space shuttle PRA, a panel of tree, or an in¯uence diagram. In¯uence diagrams are parti- experts [18] concluded ®rst and foremost that the PRA cularly helpful because they can process both probabilistic models currently developed by NASA were an important dependencies and also deterministic functions such as the step towards improving the risk management process. It is Boolean analysis involved in fault trees. They also provide a essential at this stage that NASA adopt current risk analysis graphical display of interdependencies among events. What methods to be able to improve its systems in a cost-effective is important, in any case, is less the nature of the tool itself way. Yet, it was also found that the current models exhibited than the completeness of the set of scenarios and the analy- a number of characteristics that left space for improvement. sis of dependencies that are included in the PRA. The PRA models for the different parts of the shuttle are 3.1. The effect of organizational dispersion currently constructed mostly bottom-up. The system has been divided into modules that are then analyzed. This The coordination and communication among the teams structure has no doubt facilitated the division of work, and that perform the shuttle PRAs at JSC and MSFC may not be some accident sequences that cut across modules have been suf®cient. For example, the two groups use different `ground included. The decomposition of the system, however, is rules' and assumptions, possibly because they interpreted generally one of the steps of the analysis, based on logic differently NASA's initial directions. The studies were to and if resources are constrained, on the value of information be limited to the most signi®cant of Criticality 1 items. In of further decomposition. The de®nition of modules as a addition, a common assumption was that failure of these starting point in the analysis can lead to missing failure items inevitably leads to a system failure. The ®rst question dependencies and commonality of elements among accident is to choose the items to be included in the analysis, and the sequences. Therefore, it can hide the true risk contribution two teams adopted different procedures to choose the events of an element that affects several modules if no integration included in their models, the level of detail of their studies, mechanism permits assessing the role of this component and the treatment of quantitative data. Therefore, the results across the system. obtained in the two centers are not directly comparable at this stage. In addition, when a system is divided at the onset 3.3. A simpli®ed approach to consistency in the level of of a PRA without an overarching model to ensure complete- analytical depth and detail ness and consistency, issues can surface in the treatment of dependencies across subsystems, common causes of failures It is generally impossible to include all components and and performance of the interfaces. all event scenarios in a PRA, and an adapted screening The problem of dispersion of work across centers with procedure is necessary. This screening procedure is meant insuf®cient communications is a common one that had to ®lter out the scenarios that are low contributors to the already been identi®ed in the past as one of the safety overall risk while retaining the important ones. For simpli- problems of the shuttle system. For example, after the city, the current PRAs for the space shuttle are limited to 350 E. PateÂ-Cornell, R. Dillon / Reliability Engineering and System Safety 74 (2001) 345±352 failure scenarios involving Criticality 1 and 1R items, and its failure modes (for example, a valve can be stuck open or among these scenarios, a decision is made a priori as to closed). One can then take advantage of the powers of Baye- which ones are suf®cient risk contributors to be included sian treatment of the evidence and of systems analysis to in the analysis. aggregate the risk contribution of the different elements. As mentioned above, the criticality level is only loosely Analytical judgment and ¯exibility are thus required to coupled to an item's contribution to the probability of failure ensure that the main risk contributors are identi®ed and and it was shown for the shuttle that items of Criticality 2 included in the analysis, down to the same level of risk and above were signi®cant risk contributors. Therefore, the contribution. This choice may or may not correspond to simplicity of this choice probably leads to excluding some the classical hierarchy of subsystems and components. A risky items that are possibly more dangerous than some that simpli®ed uniform approach to analytical depth can be inef- were included. More importantly, perhaps, it might elimi- fective because a detailed analysis at the component level nate from the ranking of upgrades some improvements that can be useful in some places but unnecessary in others. could be more cost-effective than those considered. This choice, by itself, would lead to an underestimation of the 3.4. Human decisions and action in a risk analysis overall probability of failure. Human decisions and actions are key factors in system The de®nition of Criticality 1 items does not imply that failure risks. Yet, they are sometimes ignored or poorly their failure inevitably leads to systems failure, only that it treated. It is important to note that they can include not can cause an accident. Several intermediate (`pivotal') only catastrophic errors, but also operators actions to correct events can occur following such a failure. Some sequences a dangerous condition. (or conjunctions) can lead to an accident, others to a safe Errors can occur in manufacturing, system assembly, ¯ight abort or to a correction by human intervention that inspection and maintenance, and operation (mission). permits completion of the mission. Yet, in order perhaps to When these errors are already included in the database produce conservative results or to balance the exclusion of used for risk estimation, they are de facto included in the other components, it is generally assumed in the current analysis and do not need to be addressed further. Existing studies that the occurrence of an initiating event of Critical- statistical data, however, may not include rare errors that ity 1 inevitably leads to an accident. Therefore, this time, can have catastrophic consequences. It seems that in the the simplifying assumption may lead to an overestimation PRAs that are currently performed by NASA, there is no of the consequences of a Criticality 1 event. It may be analysis of process errors that can affect the different that the choice of Criticality 1 items (and only of some of subsystems. This type of errors, for instance, were analyzed them) compensates for the assumption that they lead to and included in the 1993 study (mentioned above) of the system failure; but it is impossible to tell without further LOVC risks due to failure of the tiles [32,33]. These errors information whether the overall results re¯ect an over- included, for example, failure to center a tile in its cavity estimation or underestimation of the failure risk. during maintenance operations, or letting the bond dry The de®nition of initiating events is only a starting point. before applying pressure. Both can signi®cantly reduce The choice of analytical depth and of adequate level of the strength of the bond causing a tile to debond in operation detail in the different parts of a system is critical to ensure and leaving the aluminum skin exposed to heat loads at ®rst the best use of the resources spent for the analysis, and re-entry. second, the consistency of results across subsystems. The Errors can also affect the operations phase. Yet, there analytical depth in the current shuttle PRAs is simply deter- seems to be an implicit assumption in the NASA studies mined by the hierarchy of components and subsystems. A that astronauts make no mistakes (with the possible excep- certain form of consistency has thus been obtained. tion of an error at landing). Clearly, this apparent omission Alternatively, consistency in analytical depth could be of human errors tends to underestimate the risk of cata- based on the value of the additional information that one strophic failures. might expect from pursuing the analysis further down in But there is a positive side that human intervention can some parts of the system. Therefore, another rule could be reduce the risks of an accident or stop the propagation of an to stop the analysis when it does not bring additional infor- accident sequence. Again, it may be that skilled interven- mation that is likely to make a difference in the results and in tions compensate for the possibility of human error, but it is the decisions that they support [45,46]. When there are impossible to determine without further information the net suf®cient failure data at a subsystem level, it does not effect of these two omissions on the overall failure risk. need to be analyzed further, unless one seeks to evaluate the contribution of one of its components to the overall 3.5. Mixed methods in data analysis failure probability. By contrast, when there is little statisti- cal evidence about a subsystem and when data are available A risk analysis is most useful when there are few statis- at the component level only, the analysis has to be done at tical data of different nature and from different sources such the component level. Sometimes, it may even be necessary that the situation requires Bayesian treatment of the to go further and decompose the failure of a component into evidence. The frequentist approach to classical statistics E. PateÂ-Cornell, R. Dillon / Reliability Engineering and System Safety 74 (2001) 345±352 351 has the advantage of being commonly used but requires a develop its PRA models to the point where they can be used, large amount of information. Furthermore, the con®dence along with other types of information, to make safety deci- levels that qualify the results are dif®cult to interpret as sions. Given the unique nature of its systems, NASA will characteristics of uncertainties. The Bayesian analysis is probably need to go through a similar exercise and the use of more powerful in this respect and does not require a speci- quantitative risk analysis will be of great value in assisting ®ed amount of data (i.e., the quantity of information is decisions in all phases of space systems life, from design, re¯ected in the results of the uncertainty analysis). But it processing, operations and upgrading. It is important requires the use of prior probabilities (e.g., uniform distri- however, that fundamental issues be recognized and butions may be used to re¯ect complete ignorance when that resolved quickly. is truly the case), which injects an element of subjectivity in the analysis. In the case of space systems, one generally does not have the choice because ¯ight data are rare by 4. Conclusions de®nition. Yet many different types of data can be used as input to a risk study. The Probabilistic Risk Analysis method has gone through The current PRAs for the space shuttle often use both ups and downs at NASA. From the hopes of the early times frequentist and Bayesian analyses (hence possible inconsis- of the Apollo program, to the disappointments of pessimis- tencies), but not always all available information. Possible tic then optimistic results (and were wrong in both cases), it data include test data, ¯ight data, surrogate data and expert is slowly being improved and incorporated in the NASA opinions when appropriate. Therefore, the results could thinking about risk ranking and prioritization of upgrades. probably be improved by adopting consistently a Bayesian Where it is resisted, it is often because it runs against the approach, using all existing data. For example, surrogate engineering tradition of safety factors and suspicion about data can be used as priors to be updated based on additional the use of Bayesian probability. Yet, if one wants to assess experience and new ¯ight data. In any case, failure prob- the risk, the Bayesian approach is unavoidable because there abilities must be assessed differently if they represent are seldom enough data for a classical statistical analysis. marginal or conditional probabilities, in which case the PRA has now been adopted as one of the decision supports events on which they depend must be considered. for the management of the space shuttle, of the space station Finally, in the current PRAs, the simplifying choice was and of some unmanned space missions. In the long term, this made to compute ®rst-order probabilities only. The results decision will improve the consistency and the ef®ciency of are thus represented by the probability (or mean future the management of NASA's space systems. As usual, in this frequence) of different potential system states, based on early phase of the PRA modeling, several problems still the probabilities of different hypotheses or models, and of need to be addressed. Some of them are essentially organi- parameter values given these different models [47,48]. In zational (i.e., the work is divided among several space contrast, in a second-order uncertainty analysis, the uncer- centers). But the most important issues are the need for an tainties about the possible underlying hypotheses or models overarching model and fundamental consistency in the are propagated throughout the analysis. The results are choice of method of problem structure, analytical depth probability distributions of the probabilities (or future and treatment of data. As always, the value of the analysis frequencies) of different system and subsystem states. A will be determined by the use of the information that it ®rst-order probability analysis is suf®cient to set priorities provides, and NASA should realize an improvement in deci- when the ranking criterion is the mean future frequency. sion-making based on quantitative assessment rather than Yet, an assessment of uncertainties in the input (i.e., failure intuition and guesses at the system level. frequency distributions for the basic components) and consistent propagation of these uncertainties in the analysis References could permit, in addition, an assessment of the effects of uncertainties on the results and on priorities. [1] Fragola JR. Risk Management in US Manned Spacecraft: From Many of these simplifying assumptions will be unneces- Apollo to Alpha and Beyond. Proceedings of ESA Product Assurance sary after the completion of the QRAS software. QRAS is Symposium and Software Product Assurance Workshop, Noordwijk, currently being updated to eventually involve features such Netherlands, March 19±22, 1996 as fault tree analysis, an overarching model, external events, [2] Feynman R. Personal Observations on the Reliability of the Shuttle, Appendix IIF. In: Rogers, et al., 1986. human errors and adequate Bayesian treatment of all avail- [3] Bowles JB. The New SAE FMECA Standard. Proceedings of the able information. The current experience is probably a Annual Reliability and Maintainability Symposium 1998:48±53. necessary step towards the realization that such features [4] Little®eld ML. FMEA/CIL Implementation for the Space Shuttle (among others) are needed to provide results that are cred- New Turbopumps. Proceedings of the Annual Reliability and Main- ible in absolute terms, and in relative terms, permit ranking tainability Symposium 1996:48±52. [5] Onodera K. Effective Techniques of FMEA at Each Life-Cycle Stage. of upgrades by order of cost-effectiveness. Proceedings of the Annual Reliability and Maintainability Sympo- NASA's experience in this respect is not unique. The US sium 1997:50±6. Nuclear Regulatory Commission has taken a long time to [6] Agarwala AS. Reliability Engineering in Defense and Aerospace ± A 352 E. PateÂ-Cornell, R. Dillon / Reliability Engineering and System Safety 74 (2001) 345±352 Transition to the Commercial World. Communications in Reliability, [27] National Research Council (NRC). Post-Challenger Evaluation of Maintainability, and Supportability 1994;1(1):14±9. Space Shuttle Risk Assessment and Management. Committee on [7] Davison M, Vantine WL. Understanding Risk Management: A Shuttle Criticality Review and Hazard Analysis Audit of the Aero- Review of the Literature and Industry Practice. European Space nautics and Space Engineering Board, National Academy of Sciences, Agency Risk Management Workshop, ESTEC, March 30±April 2, National Research Council, National Academy Press, Washington, 1998:253±6. DC, January, 1988. [8] Frank M. A Survey of Risk Assessment Methods from the Nuclear, [28] Slay, et al. Space Shuttle Risk Assessment Proof-of-Concept Study, Chemical, and Aerospace Industries for Applicability to the Priva- Auxiliary Power Unit and Hydraulic Power Unit Analysis Report. tized Vitri®cation of Hanford Tank Wastes. Report to the Nuclear McDonnell Douglas Corp., December 18, 1987. Regulatory Commission, August, 1998. [29] Plistiras J., et al. Space Shuttle Main Propulsion Pressurization [9] Frank M. Assessment of the Cassini Mission Nuclear Risk with Alea- System Probabilistic Risk Assessment, Final Report. Lockheed tory and Epistemic Uncertainties. Proceedings of the 4th International Corporation, Palo Alto, CA, 1988. Conference on Probabilistic Safety Assessment and Management. [30] Buchbinder B. Independent Assessment of Shuttle Accident Scenario September 13±18, 1998. Probabilities for the Galileo Mission, Volume 1. NASA/HQ Code QS, [10] Frank M. Personal correspondence describing NASA project work. Washington DC, 20546, April, 1989. October, 1998. [31] SAIC. Probabilistic Risk Assessment of the Space Shuttle Phase 1: [11] Guarro S, Bream B, Rudolph LK, Mulvihill RJ. The Cassini mission Space Shuttle Catastrophic Failure Frequency Final Report, 1993. risk assessment framework and application techniques. Reliability [32] PateÂ-Cornell ME, Fischbeck PS. Probabilistic risk analysis and risk- Engineering and System Safety 1995;49:293±302. based priority scale for the tiles of the space shuttle. Reliability Engi- [12] Jet Propulsion Laboratory (JPL). Cassini Recerti®cation Review, JPL neering and System Safety 1993;41:221±38. Internal Document D-11715, 2. Pasadena, California: Jet Propulsion [33] PateÂ-Cornell ME, Fischbeck PS. PRA as a management tool: organi- Laboratory, 1994. zational factors and risk-based priorities for the maintenance of the [13] Miles R. Personal correspondence describing NASA project work, tiles of the space shuttle orbiter. Reliability Engineering and System July, 1998. Safety 1993;41:239±57. [14] Mulvihill RJ. Personal correspondence describing NASA project [34] SAIC. Probabilistic Risk Assessment of the Space Shuttle, 1995. work, July, 1999. [35] U.S. Nuclear Regulatory Commission (USNRC). Reactor Safety [15] Railsback J. Personal correspondence describing NASA project work, Study: Assessment of Accident Risk in U.S. Commercial Nuclear July, 1998. Plants, WASH-1400 (NUREG-75/014). Washington, DC: U.S. [16] Shemanski T, Silke K. Reliability Growth Model Overview. Relia- Nuclear Regulatory Commission, 1975. bility Bulletin 92-02, General Dynamics Space Systems Division, [36] U.S. Nuclear Regulatory Commission (USNRC). PRA Procedures 1992. Guide, NUREG/CR-2300. Washington DC: U.S. Nuclear Regulatory [17] Silke K, Bennett J. Launch Vehicle Reliability Assessment. Reliabil- Commission, 1983. ity Bulletin 92-01, General Dynamics Space Systems Division, 1992. [37] U.S. Nuclear Regulatory Commission (USNRC). Procedural and [18] PateÂ-Cornell ME, Frank MV, Mulvihill RJ, Fragola JR. On the current Submittal Guidance for the Individual Plant Examination of External status of Probabilistic Risk Analysis for the US Space Shuttle, Report Events (IPEEE) for Severe Accident Vulnerabilities, Final Report. to the National Aeronautic and Space Administration, Code Q, Washington, DC: U.S. Nuclear Regulatory Commission, 1991. Washington D.C., February, 2000. [38] U.S. Nuclear Regulatory Commission (USNRC). A Technique For [19] Maggio G. Space Shuttle Probabilistic Risk Assessment: Methodol- Human Error Analysis (Atheana). Washington, DC: Division of ogy and Application. Proceedings of the Annual Reliability and Main- Systems Technology, Of®ce of Nuclear Regulatory Research, 1996. tainability Symposium 1996:121±32. [39] Vesely WE. Fault Tree Handbook. Washington, DC: Of®ce of [20] Rutledge P, Weinstock R. Quantitative Risk Assessment System Nuclear Regulatory Research, 1981. (QRAS). Proceedings of the 4th International Conference on Prob- [40] Mosleh A. Procedure For Analysis Of Common-Cause Failures In abilistic Safety Assessment and Management, September 13±18, Probabilistic Safety Analysis. Washington DC: Division of Safety 1998. Issue Resolution, Of®ce of Nuclear Regulatory Research, Nuclear [21] Sa®e FM. An Overview of Quantitative Risk Assessment of Space Regulatory Commission, 1993. Shuttle Propulsion Elements. Proceedings of the 4th International [41] Fragola J.R. Space Shuttle Probabilistic Risk Assessment. Proceed- Conference on Probabilistic Safety Assessment and Management, ings of PSAMIII, Crete, Greece, 1996. September 13±18, 1998. [42] Mosleh A. Personal correspondence describing NASA project work, [22] Frank M. Applications of Technical Risk Assessment in Aerospace. September, 1998. European Space Agency Risk Management Workshop, ESTEC, [43] Mosleh A. Quantitative Risk Assessment System: Software Require- March 30±April 2, 1998:43±66. ment, University of Maryland, CTRS A5-5.1, May, 1998. [23] Baker J. Space Shuttle Range Safety Hazards Analysis. Technical [44] Mosleh A. Quantitative Risk Assessment System: Software Design, Report 81-1329, prepared for NASA, KSC, J. Baker (author), John University of Maryland, CTRS A5-5.2, May, 1998. Wiggins Inc., 1981. [45] Howard RA. Information Value Theory in The Principles and Appli- [24] Rogers W. et al. Report of the Presidential Commission on the Space cations of Decision Analysis. Howard RA, Matheson JE (eds.) Palo Shuttle Challenger Accident, Washington D.C., 1986. Alto, CA: Strategic Decisions Group, 1989. [25] Kaplan S. On the Inclusion of Precursors and Near-Miss Events in [46] Matheson JE. The Economic Value of Analysis and Computation. In: Quantitative Risk Assessments: A Bayesian Point of View and a Howard RA, Matheson JE, editors. The Principles and Applications of Space Shuttle Example. Reliability Engineering and System Safety Decision Analysis, Palo Alto, CA: Strategic Decisions Group, 1989. 1990;27:103±15. [47] Helton JC. Treatment of uncertainty in performance assessment for [26] Pinkus RL, Shuman LJ, Hummon NP, Wolfe H. Engineering Ethics: complex systems. Risk Analysis 1994;14:483±511. Balancing Cost, Schedule, and Risk- Lessons Learned from the Space [48] PateÂ-Cornell ME. Uncertainties in risk analysis: Six levels of treat- Shuttle. Cambridge: Cambridge University Press, 1997. ment. Reliability Engineering and System Safety 1996;54:95±111.