Reliability Engineering and System Safety 74 (2001) 345±352
                                                                                                                       www.elsevier.com/locate/ress


   Probabilistic risk analysis for the NASA space shuttle: a brief history
                               and current work
                                             Elisabeth PateÂ-Cornell a,*, Robin Dillon b
                           a
                           Department of Management Science and Engineering, Stanford University, Stanford, CA 94305, USA
      b
          Department of Management Science and Information Technology, Pamplin College of Business, Virginia Tech, Falls Church, VA22043, USA
                                                       Received 2 June 2000; accepted 13 December 2000


Abstract
   While NASA managers have always relied on risk analysis tools for the development and maintenance of space projects, quantitative and
especially probabilistic techniques have been gaining acceptance in recent years. In some cases, the studies have been required, for example,
to launch the Galileo spacecraft with plutonium fuel, but these successful applications have helped to demonstrate the bene®ts of these tools.
This paper reviews the history of probabilistic risk analysis (PRA) by NASA for the space shuttle program and discusses the status of the on-
going development of the Quantitative Risk Assessment System (QRAS) software that performs PRA. The goal is to have within NASA a
tool that can be used when needed to update previous risk estimates and to assess the bene®ts of possible upgrades to the system. q 2001
Elsevier Science Ltd. All rights reserved.
Keywords: Probabilistic risk analysis; space shuttle


1. Introduction                                                                  turned away from quantitative risk assessment methods.
                                                                                 A few years later, however, failure probabilities such as
   For a long time, NASA managers have viewed probabil-                          10 25 or 10 26 per ¯ight were often casually quoted without
istic risk analysis (PRA) and expected-utility decision                          much justi®cation, either for subsystems or for entire
analysis with some suspicion, and many still do. The agency                      missions, to express and support NASA's con®dence in
often operates new or improved systems, and there is                             their performances [2].
seldom an abundance of statistics to describe past perfor-                          Rather than quantifying failure probabilities, the agency
mance. Probabilistic risk analysis, however, is most useful                      has generally preferred qualitative analyses such as Failure
when little statistical data are available to assess the failure                 Mode and Effect Analysis (FMEA), Critical Item Lists
probability of a whole system. In these cases, it is often                       (CILs), and Risk Matrices [3±5]. FMEA/CIL relies on the
useful to decompose the system into subsystems and                               logical identi®cation of a system's weak points and of fail-
components to quantify the overall failure risk as a function                    ure/event combinations (cut-sets) leading to its catastrophic
of the system's architecture and of the probabilities of fail-                   failure. Risk matrices usually include, for different compo-
ure of the different elements for which more data are gener-                     nents or subsystems, qualitative information and corre-
ally available.                                                                  sponding scale indices about the likelihood of failure
   At the onset of the Apollo program, NASA seemed to                            events (e.g., high, medium or low) and the severity of
have accepted the notion that quantitative risk analysis                         their consequences (e.g., high, medium, or low). These
could be useful for decision support [1]. But the failure                        matrices are often used as ®lters to decide which are the
probabilities computed for some missions of the Apollo                           highest priority technical problems. A major dif®culty
program were largely overestimated because they were                             when using risk matrices is to combine such information
based on conservative estimates of subsystem failure risks.                      about the different components to characterize the robust-
Because the results were so pessimistic and showed such a                        ness of the whole system.
small probability of mission success, NASA at that time,                            Since the Challenger accident, however, the use of PRA
                                                                                 at NASA has increased signi®cantly, not only for the space
 * Corresponding author.
                                                                                 shuttle but also for some unmanned space missions and for
    E-mail addresses: mep@leland.stanford.edu            (E.   PateÂ-Cornell),   the space station [6±17]. Probabilistic models have been
dillon@vt.edu (R. Dillon).                                                       and are being developed to assess the risk contribution of
0951-8320/01/$ - see front matter q 2001 Elsevier Science Ltd. All rights reserved.
PII: S 0951-832 0(01)00081-3
346                        E. PateÂ-Cornell, R. Dillon / Reliability Engineering and System Safety 74 (2001) 345±352

speci®c shuttle problems and the bene®ts of shuttle                        consequences. All single-point failures that cannot be elimi-
upgrades. These models are currently implemented through                   nated are collected in a Critical Items List. The process by
the software called QRAS (Quantitative Risk Assessment                     which this list is established is a static, qualitative, bottom-
System), itself in its development phase. Therefore, the                   up approach, focused on the identi®cation and reduction of
current shuttle PRA model is both a motivation and a                       the risk of single, independent component failures that can
product of this development. The goal is to have within                    cause the loss of crew, vehicle, or mission [1,19±22]. In the
NASA a tool that can be used when needed to update                         critical item list, components are listed according to their
previous risk estimates and to assess the bene®ts of possible              level of criticality. Criticality 1 characterizes an element
upgrades.                                                                  whose failure is suf®cient to cause overall shuttle failure
   This paper describes brie¯y the history of shuttle PRA                  de®ned by loss of vehicle and crew (LOVC). Criticality
and the current efforts to develop a complete risk analysis                1R is when there is one redundancy to prevent LOVC.
model at the same time as the QRAS software. It is based in                The problem is that the level of criticality alone is not an
part on studies such as Fragola [1] and on the results of a                indicator of the contribution of a particular component to the
recent panel review of current NASA PRA work for the                       overall failure risk. A Criticality 1 component with a low
shuttle [18].                                                              failure probability can be less threatening to a system's
                                                                           safety than several components in parallel, each with a
                                                                           high probability of failure, especially if these failures are
2. PRA for NASA space shuttle: a brief history                             highly dependent. Regardless of its shortcomings, FMEA is
                                                                           a ®rst step towards a PRA and the complete description of
2.1. Prior to the Challenger accident                                      all accident sequences. The next step is the assessment of
                                                                           the probability of system failure per time unit or per opera-
   As mentioned above, at the onset of the Apollo program,                 tion as a function of the probabilities of relevant events
NASA generally accepted the notion of using risk analysis                  including component failures, accounting for dependencies
to choose some mission features and to compare the results                 and in particular, those caused by external events.
to safety benchmarks [1]; but during the Apollo program,                      In addition to qualitative studies, some quantitative risk
pessimistic risk estimates discouraged the agency to adopt                 analyses had been performed for the space shuttle before the
quantitative risk analysis. As is often the case, the problem              Challenger accident during ¯ight 51L in January 1986, (see
was that conservative values (as opposed to means) of future               for example, Baker [23]). These were small-budget studies,
failure frequencies were used to account for uncertainties,                highly constrained and limited by the de®nition of their
instead of a full uncertainty analysis. In truth, the methods              scope and by restrictions on data sources. NASA mandated
were in their infancy and the software needed did not exist.               that in addition to the limited data available from past
The conservative approach seemed prudent but the results                   performances, the analysts should use as data the opinions
were both alarming and discouraging. As usual when it is                   of NASA's experts regarding the performance of speci®c
clear that they contradict the facts, wrong results became                 systems. These experts, for example, estimated the probabil-
detrimental to the acceptance of the quantitative risk                     ity of a Solid Rocket Booster (SRB) failure at 10 25 per ¯ight
analysis method itself.                                                    without any formal systems analysis needed to support such
   Furthermore, notions of failure risk and failure probabil-              an estimate. It is on that basis that early risk analyses for the
ity often clashed (and still do) with the engineering culture,             shuttle system indicated a probability of LOVC in the order
primarily based on safety factors, which can be de®ned in                  of one in several thousand per ¯ight. Consequently, this
many different ways, and include, for example, design for                  time, they signi®cantly underestimated the failure risk.
higher loads than anticipated. Useful as they may be in                    Yet, the astronauts knew from their experience with minor
decreasing failure risks, the problem with safety factors is               failures in ¯ight (e.g., of a particular switch in the crew
that they are not directly linked to the probability of system             cabin) that the risks were probably much higher. Feynman,
failure and the relationship may vary across systems. For                  in his Appendix to the Rogers Commission report [24]
example, a safety factor of two in one system may not imply                emphasized the problems with such estimates and the down-
the same safety level as in another. Therefore, safety factors             side of overcon®dence. Indeed, Kaplan [25] showed that a
alone are insuf®cient to support cost-effective allocation of              Bayesian analysis based on the shuttle `near-misses' prior to
upgrade resources and prioritization of retro®ts across                    the Challenger accident could have indicated a much higher
different systems.                                                         failure probability.
   Rather than quantitative risk estimates, shuttle program
managers preferred the use of Failure Modes and Effects                    2.2. Post-Challenger risk analyses
Analysis and Critical Items List (FMEA/CIL) to identify a
system's weak points and to manage failure risks accord-                     The Challenger accident drastically altered this optimism.
ingly. FMEA is useful to the extent that it indicates which                NASA and its contractors performed a major review of all
possible design changes might eliminate a failure mode,                    shuttle FMEAs and updated the Critical Item List. The
reduce its future frequency to a lower level, or mitigate its              immediate result was a large increase in the number of
                           E. PateÂ-Cornell, R. Dillon / Reliability Engineering and System Safety 74 (2001) 345±352                  347

critical items from 2,369 to 4,686 [26]. In addition, the                  (reference to original study, see Fragola [1]). These studies
Rogers Commission [24] that investigated the accident                      also showed that components of criticality higher than 1 (i.e.
concluded that the perceptions of shuttle failure risks had                less critical to LOVC) could contribute as much as 30% of
been overly optimistic and that the risk assessment methods                the overall failure risk. This implies that limiting a PRA to
needed improvement. Furthermore, priorities needed to be                   Criticality 1 items is likely to lead to an underestimation of
set among risk mitigation measures given the limitations of                the risk.
NASA's resources. Qualitative methods that had provided                       Shortly after the completion of these proof-of-concept
general guidance for risk management were inadequate for                   studies, NASA was required to fund an independent shuttle
prioritization because they did not allow quanti®cation of                 PRA to support the approval of the launch of the Galileo
the relative contributions of the different components to the              mission because plutonium fuel was present on the space-
probabilities of system failure.                                           craft [30]. This study was limited to the ascent portion of the
   At about the same time, a National Research Council                     mission and focused primarily on scenarios that presented a
(NRC) panel reviewed the risk assessment and management                    risk to the nuclear payload. It included a propagation of
of the space shuttle program, and recommended quantitative                 uncertainties about the future frequencies of component
approaches to set priorities among possible upgrades of                    failures, thus providing a probability distribution for the
critical items. The NRC panel found that previous quanti®-                 future frequency of a shuttle accident or ¯ight abort. Despite
cation of the shuttle risks were based almost exclusively on               NASA's conclusion that the probability of shuttle failure
subjective judgments and qualitative rationales, even                      could be high, (median estimate 1/78), this study concluded
though quantitative engineering analyses and test data rele-               that the risk to the public caused by plutonium contamina-
vant to risk assessment were available and could have been                 tion was low.
used [27].                                                                    Prior to approval of the launch of the Ulysses spacecraft
   In the late 1980s and early 1990s, probabilistic risk analy-            from the shuttle, the shuttle risk analysis that was required
sis (PRA) therefore seemed a better alternative to qualitative             for the Galileo mission needed updating to consider any
risk assessment. Yet, within NASA, there was still strong                  variations associated with this spacecraft since the previous
resistance. First, the cost of a complete PRA seemed high.                 study. In 1993, NASA thus commissioned an update of the
The value of information as decision support was not well                  Galileo study, using Bayesian techniques to integrate the
understood, and it was sometimes stated that instead, the                  former risk estimates with the new evidence that had been
same amount of money could be better invested in strength-                 gathered since the original report [31].
ening the system. The question of course is: where should                     Around the same time (the early 1990s), damage to
that investment be made in priority, and how much will                     several of the tiles of the shuttle heat shield during previous
eventually be gained by replacing intuition by quantitative                missions prompted NASA's management to commission a
decision support? Second, the use of Bayesian probability,                 study of the thermal protection system (TPS). This study
which is often the only option given the systems' novelty,                 showed that the contribution of the black tiles that protect
was often considered at NASA too `subjective' to be trusted                the underside of the shuttle orbiter at re-entry was about
for decision support. That was true until it became obvious                10% of the overall shuttle failure risk [32]. Only two tiles
that there was no better alternative because by de®nition,                 had failed in ¯ight thus far, without causing damage to the
extensive data sets did not exist. At the very least, these                orbiter's skin. One failed because of a weak bond, and the
methods allowed a systematic and consistent assessment                     other because it was hit by a piece of debris probably
and treatment of the risk components.                                      coming from the insulation of the external tank. That
   In the following years (1990s), a number of pilot studies               study showed that 15% of the tiles were the source of
and a ®rst attempt at a comprehensive shuttle PRA study                    85% of the probability of a shuttle accident induced by
were undertaken. In an initial attempt to incorporate prob-                TPS failure. More importantly, on the basis of a simple
abilistic risk analysis methods in its decision support, NASA              ®rst-order risk analysis, it allowed ranking the tiles by
commissioned two `proof-of-concept' studies. Their objec-                  order of risk contribution, and therefore, setting priorities
tive was to determine if a PRA could identify high-risk areas              in the tile inspection before each ¯ight [32,33].
that traditional FMEA/CIL and hazard analysis techniques                      Then, in 1995, NASA funded the ®rst attempt at a
could not. One of these studies focused on the auxiliary                   comprehensive quantitative risk assessment including all
power units (APUs), and the other on the main propulsion                   phases of a shuttle mission [34]. The method used was
pressurization system [28,29]. These two studies showed                    similar to the PRA framework developed by the US Nuclear
that the probabilities of failure of a small number of CIL                 Regulatory Commission [35±40] (Master Logic Diagram,
items represented most of the shuttle failure risk, and that in            fault tree and event tree analyses, etc.) to obtain the prob-
addition, several important failure scenarios had not been                 ability of a major accident as a function of the probabilities
identi®ed by NASA's previous analyses. Furthermore, the                    of component and subsystem failures [41]. Because of
APU study demonstrated that the number of redundancies                     resource limitations, however, a number of components
had to be weighed against the increase of risk of ®re and                  were assumed to contribute negligible additional risks and
explosion caused by the possibility of a hydrazine leakage                 were not included in the analysis. In addition, some external
348                        E. PateÂ-Cornell, R. Dillon / Reliability Engineering and System Safety 74 (2001) 345±352

accident initiators were left out, for example, the penetration            version of the QRAS software and the performance of a
by micrometeroids of systems other than the tiles. The                     PRA. It will result in a model that will provide an overall
results showed a probability of a shuttle accident (LOVC)                  shuttle failure probability and will allow estimation of the
between 1/76 and 1/230, which seemed consistent with the                   risk changes associated with proposed shuttle upgrades
limited experience available at that time. Following that                  (e.g., an upgrade of the main engine turbopumps).
study, NASA managers (as the Nuclear Regulatory                               Two separate teams are currently developing these risk
Commission had done a few years before) decided to use                     analysis models. The ®rst one, at Johnson Space Center
PRA as one of the bases for the support of decisions regard-               (JSC) analyzes the orbiter and its main propulsion systems
ing improvements in shuttle safety. They needed a tool to                  including the auxiliary power systems, hydraulic system,
routinely perform shuttle PRAs, which had to be updated                    thrust vector control, and main propulsion system. The
regularly to monitor risk variations and to evaluate the                   second team, at Marshall Space Flight Center (MSFC), is
effects of changes in design and operation procedures.                     in charge of the other shuttle subsystems of the main
This is the effort that is now underway and which we                       engines, the external tank, the solid rocket boosters, and
describe further.                                                          the reusable solid rocket motors. These studies are designed
   Many lessons were already learned in these early PRAs;                  to be limited to Criticality 1 and 1R items, and generally
for example:                                                               assume that failures of such items inevitably lead to an
                                                                           accident or mission failure. For analytical purposes, the
1. Conservative estimates should not be mixed with prob-                   system, as well as the PRA, have been divided into
   abilities that represent (for instance) mean future                     `modules'. The analysis is being done `bottom-up' on the
   frequencies of failures. Otherwise the results are mean-                basis of these modules. Some links have been included to
   ingless and possibly counterproductive.                                 ensure that an accident sequence that cuts across modules,
2. Guessing the probability of failure of a complex system                 and across analytical teams, are accounted. Yet, the current
   such as the SRBs is unlikely to lead to an accurate ®gure               exercise is facing some of the classical dif®culties of coor-
   when the system can be analyzed to provide a better                     dinating a PRA study when the system has been divided for
   result.                                                                 analytical purposes. Both teams are supposed to rely on
3. Near-miss events and partial failures can provide valu-                 QRAS while it is still in its development phase. Therefore,
   able information for the assessment of system failure                   at this stage, some elements of the PRA (e.g., fault tree
   risk, especially when a catastrophic failure has not yet                analysis results, especially for the analysis of the orbiter)
   happened.                                                               are computed `off-line' independently from the existing
4. Restricting a PRA to Criticality 1 items is likely to lead to           software. A panel of experts recently reviewed the current
   an underestimation of the failure risks.                                shuttle PRA efforts [18]. Some of the comments of this
5. Adding redundancies does not always improve the safety                  panel are described below.
   of the system (APUs, for example, introduce an added                       The computer software QRAS originated at NASA Head-
   risk of hydrazine leakage that has to be weighed against                quarters in conjunction with the University of Maryland in
   the value of an extra redundancy).                                      1998 [42±44]. It is currently being developed by NASA
6. A top-down analysis is needed to capture the dependen-                  Headquarters and its contractors and subcontractors, includ-
   cies among system failures, for example between the                     ing Allied Signal and L&M Technology. QRAS aggregates
   debonding of debris from the insulation of the external                 subsystem failure mode probabilities from the bottom-up to
   tank and their effects on the tiles of the heat shield.                 produce intermediate and top-level catastrophic failure
                                                                           probabilities and bounds on the uncertainties. It is based
                                                                           on the identi®cation of a set of scenarios represented by
2.3. The current work on shuttle PRA and the QRAS                          event sequence diagrams (ESDs), starting with an initiating
software                                                                   event and ending with an accident, a ¯ight abort, or a benign
                                                                           outcome, either directly or through a sequence of intermedi-
   The studies mentioned above were all completed by inde-                 ate (`pivotal') events. Among the results is a prioritization of
pendent consultants outside of NASA. In July 1996, the                     the subsystem failure modes that contribute most to the
NASA administrator requested that an independent quanti-                   overall risk, and an evaluation (and ranking) of space shuttle
tative analysis of the risk of a shuttle accident be conducted             potential upgrades, both from a safety and a cost point of
by internal NASA experts, and that supporting software be                  view [20]. The ®rst version of this software is currently used
developed. The long-term objective is to use the results as                at JSC to assess the failure risks of the shuttle orbiter, and
decision support for shuttle upgrades. The chosen approach                 at MSFC to assess the failure risks of the other shuttle
is to develop the Quantitative Risk Analysis System                        subsystems.
(QRAS) software to perform PRA, permit its updating,                          Once the QRAS software is completed and available,
and allow real-time support of decisions ranging from retro-               NASA will be able to develop and upgrade PRA models
®t to launch under speci®ed circumstances. This ongoing                    at very detailed levels, integrating physical models of failure
study involves, in parallel, the improvement of the ®rst                   processes into the logic model and the probabilistic analysis.
                           E. PateÂ-Cornell, R. Dillon / Reliability Engineering and System Safety 74 (2001) 345±352                   349

For the moment, however, the two space centers that are                    Challenger accident, the Rogers commission [24] showed
charged of the development of PRA models using the ®rst                    that poor communications were at the source of the fatal
version of the QRAS software have had mixed experiences                    decision to launch on that day. In the same way, during
with it because it still misses important features. For exam-              the earlier analysis of the tiles contribution to failure risks,
ple, in its current version, QRAS treats accident sequences                it was shown that part of the risk of tile failure could be
independently from others even though some may have                        attributed to debris hits caused by the debonding of parts of
common events. It does not include proper treatment of                     the insulation of the external tank [32]. Yet, the two systems
external events and common causes of failures, and it                      are managed independently, the external tank at Marshal
does not have the capability of building and analyzing                     Space Flight Center and the tiles at Kennedy Space Center,
fault trees. Therefore, some of the current PRA work has                   and it took the chemical analysis of a missing tiles cavity
to be done off-line, for instance using fault tree analyses that           before the link was established.
are not part of the current software, before integrating the
results in the QRAS models. A consortium of industry                       3.2. Analytical modules as opposed to an overarching model
contractors, the United Space Alliance (USA), has been
                                                                              The role of an overarching model is to ensure the comple-
charged with the space shuttle operations and is monitoring
                                                                           teness and the accuracy of a PRA, the inclusion of depen-
the shuttle PRA work done both at JSC and MSFC. Its
                                                                           dencies across systems and of common events across
objective is to use the results to support recommendations
                                                                           accident sequences, and the proper treatment of external
for upgrades of the shuttle design as well as improvements
                                                                           events that can affect simultaneously several subsystems.
of maintenance and processing operations.
                                                                           This requires a top-down approach starting from a systema-
                                                                           tic analysis of accident sequences, or conjunctions of events
3. Some characteristics of the current PRA modeling                        leading to failure. An overarching model can be based on
efforts for the space shuttle                                              different tools such as the Master Logic Diagram developed
                                                                           and used in the nuclear power industry, a complete event
   In a recent review of the space shuttle PRA, a panel of                 tree, or an in¯uence diagram. In¯uence diagrams are parti-
experts [18] concluded ®rst and foremost that the PRA                      cularly helpful because they can process both probabilistic
models currently developed by NASA were an important                       dependencies and also deterministic functions such as the
step towards improving the risk management process. It is                  Boolean analysis involved in fault trees. They also provide a
essential at this stage that NASA adopt current risk analysis              graphical display of interdependencies among events. What
methods to be able to improve its systems in a cost-effective              is important, in any case, is less the nature of the tool itself
way. Yet, it was also found that the current models exhibited              than the completeness of the set of scenarios and the analy-
a number of characteristics that left space for improvement.               sis of dependencies that are included in the PRA.
                                                                              The PRA models for the different parts of the shuttle are
3.1. The effect of organizational dispersion                               currently constructed mostly bottom-up. The system has
                                                                           been divided into modules that are then analyzed. This
   The coordination and communication among the teams
                                                                           structure has no doubt facilitated the division of work, and
that perform the shuttle PRAs at JSC and MSFC may not be
                                                                           some accident sequences that cut across modules have been
suf®cient. For example, the two groups use different `ground
                                                                           included. The decomposition of the system, however, is
rules' and assumptions, possibly because they interpreted
                                                                           generally one of the steps of the analysis, based on logic
differently NASA's initial directions. The studies were to
                                                                           and if resources are constrained, on the value of information
be limited to the most signi®cant of Criticality 1 items. In
                                                                           of further decomposition. The de®nition of modules as a
addition, a common assumption was that failure of these
                                                                           starting point in the analysis can lead to missing failure
items inevitably leads to a system failure. The ®rst question
                                                                           dependencies and commonality of elements among accident
is to choose the items to be included in the analysis, and the
                                                                           sequences. Therefore, it can hide the true risk contribution
two teams adopted different procedures to choose the events
                                                                           of an element that affects several modules if no integration
included in their models, the level of detail of their studies,
                                                                           mechanism permits assessing the role of this component
and the treatment of quantitative data. Therefore, the results
                                                                           across the system.
obtained in the two centers are not directly comparable at
this stage. In addition, when a system is divided at the onset             3.3. A simpli®ed approach to consistency in the level of
of a PRA without an overarching model to ensure complete-                  analytical depth and detail
ness and consistency, issues can surface in the treatment of
dependencies across subsystems, common causes of failures                     It is generally impossible to include all components and
and performance of the interfaces.                                         all event scenarios in a PRA, and an adapted screening
   The problem of dispersion of work across centers with                   procedure is necessary. This screening procedure is meant
insuf®cient communications is a common one that had                        to ®lter out the scenarios that are low contributors to the
already been identi®ed in the past as one of the safety                    overall risk while retaining the important ones. For simpli-
problems of the shuttle system. For example, after the                     city, the current PRAs for the space shuttle are limited to
350                        E. PateÂ-Cornell, R. Dillon / Reliability Engineering and System Safety 74 (2001) 345±352

failure scenarios involving Criticality 1 and 1R items, and                its failure modes (for example, a valve can be stuck open or
among these scenarios, a decision is made a priori as to                   closed). One can then take advantage of the powers of Baye-
which ones are suf®cient risk contributors to be included                  sian treatment of the evidence and of systems analysis to
in the analysis.                                                           aggregate the risk contribution of the different elements.
   As mentioned above, the criticality level is only loosely                  Analytical judgment and ¯exibility are thus required to
coupled to an item's contribution to the probability of failure            ensure that the main risk contributors are identi®ed and
and it was shown for the shuttle that items of Criticality 2               included in the analysis, down to the same level of risk
and above were signi®cant risk contributors. Therefore, the                contribution. This choice may or may not correspond to
simplicity of this choice probably leads to excluding some                 the classical hierarchy of subsystems and components. A
risky items that are possibly more dangerous than some that                simpli®ed uniform approach to analytical depth can be inef-
were included. More importantly, perhaps, it might elimi-                  fective because a detailed analysis at the component level
nate from the ranking of upgrades some improvements that                   can be useful in some places but unnecessary in others.
could be more cost-effective than those considered. This
choice, by itself, would lead to an underestimation of the                 3.4. Human decisions and action in a risk analysis
overall probability of failure.
                                                                              Human decisions and actions are key factors in system
   The de®nition of Criticality 1 items does not imply that
                                                                           failure risks. Yet, they are sometimes ignored or poorly
their failure inevitably leads to systems failure, only that it
                                                                           treated. It is important to note that they can include not
can cause an accident. Several intermediate (`pivotal')
                                                                           only catastrophic errors, but also operators actions to correct
events can occur following such a failure. Some sequences
                                                                           a dangerous condition.
(or conjunctions) can lead to an accident, others to a safe
                                                                              Errors can occur in manufacturing, system assembly,
¯ight abort or to a correction by human intervention that
                                                                           inspection and maintenance, and operation (mission).
permits completion of the mission. Yet, in order perhaps to
                                                                           When these errors are already included in the database
produce conservative results or to balance the exclusion of
                                                                           used for risk estimation, they are de facto included in the
other components, it is generally assumed in the current
                                                                           analysis and do not need to be addressed further. Existing
studies that the occurrence of an initiating event of Critical-
                                                                           statistical data, however, may not include rare errors that
ity 1 inevitably leads to an accident. Therefore, this time,
                                                                           can have catastrophic consequences. It seems that in the
the simplifying assumption may lead to an overestimation
                                                                           PRAs that are currently performed by NASA, there is no
of the consequences of a Criticality 1 event. It may be
                                                                           analysis of process errors that can affect the different
that the choice of Criticality 1 items (and only of some of
                                                                           subsystems. This type of errors, for instance, were analyzed
them) compensates for the assumption that they lead to
                                                                           and included in the 1993 study (mentioned above) of the
system failure; but it is impossible to tell without further
                                                                           LOVC risks due to failure of the tiles [32,33]. These errors
information whether the overall results re¯ect an over-
                                                                           included, for example, failure to center a tile in its cavity
estimation or underestimation of the failure risk.
                                                                           during maintenance operations, or letting the bond dry
   The de®nition of initiating events is only a starting point.
                                                                           before applying pressure. Both can signi®cantly reduce
The choice of analytical depth and of adequate level of
                                                                           the strength of the bond causing a tile to debond in operation
detail in the different parts of a system is critical to ensure
                                                                           and leaving the aluminum skin exposed to heat loads at
®rst the best use of the resources spent for the analysis, and
                                                                           re-entry.
second, the consistency of results across subsystems. The
                                                                              Errors can also affect the operations phase. Yet, there
analytical depth in the current shuttle PRAs is simply deter-
                                                                           seems to be an implicit assumption in the NASA studies
mined by the hierarchy of components and subsystems. A
                                                                           that astronauts make no mistakes (with the possible excep-
certain form of consistency has thus been obtained.
                                                                           tion of an error at landing). Clearly, this apparent omission
   Alternatively, consistency in analytical depth could be
                                                                           of human errors tends to underestimate the risk of cata-
based on the value of the additional information that one
                                                                           strophic failures.
might expect from pursuing the analysis further down in
                                                                              But there is a positive side that human intervention can
some parts of the system. Therefore, another rule could be
                                                                           reduce the risks of an accident or stop the propagation of an
to stop the analysis when it does not bring additional infor-
                                                                           accident sequence. Again, it may be that skilled interven-
mation that is likely to make a difference in the results and in
                                                                           tions compensate for the possibility of human error, but it is
the decisions that they support [45,46]. When there are
                                                                           impossible to determine without further information the net
suf®cient failure data at a subsystem level, it does not
                                                                           effect of these two omissions on the overall failure risk.
need to be analyzed further, unless one seeks to evaluate
the contribution of one of its components to the overall                   3.5. Mixed methods in data analysis
failure probability. By contrast, when there is little statisti-
cal evidence about a subsystem and when data are available                    A risk analysis is most useful when there are few statis-
at the component level only, the analysis has to be done at                tical data of different nature and from different sources such
the component level. Sometimes, it may even be necessary                   that the situation requires Bayesian treatment of the
to go further and decompose the failure of a component into                evidence. The frequentist approach to classical statistics
                           E. PateÂ-Cornell, R. Dillon / Reliability Engineering and System Safety 74 (2001) 345±352                           351

has the advantage of being commonly used but requires a                    develop its PRA models to the point where they can be used,
large amount of information. Furthermore, the con®dence                    along with other types of information, to make safety deci-
levels that qualify the results are dif®cult to interpret as               sions. Given the unique nature of its systems, NASA will
characteristics of uncertainties. The Bayesian analysis is                 probably need to go through a similar exercise and the use of
more powerful in this respect and does not require a speci-                quantitative risk analysis will be of great value in assisting
®ed amount of data (i.e., the quantity of information is                   decisions in all phases of space systems life, from design,
re¯ected in the results of the uncertainty analysis). But it               processing, operations and upgrading. It is important
requires the use of prior probabilities (e.g., uniform distri-             however, that fundamental issues be recognized and
butions may be used to re¯ect complete ignorance when that                 resolved quickly.
is truly the case), which injects an element of subjectivity in
the analysis. In the case of space systems, one generally
does not have the choice because ¯ight data are rare by                    4. Conclusions
de®nition. Yet many different types of data can be used as
input to a risk study.                                                        The Probabilistic Risk Analysis method has gone through
   The current PRAs for the space shuttle often use both                   ups and downs at NASA. From the hopes of the early times
frequentist and Bayesian analyses (hence possible inconsis-                of the Apollo program, to the disappointments of pessimis-
tencies), but not always all available information. Possible               tic then optimistic results (and were wrong in both cases), it
data include test data, ¯ight data, surrogate data and expert              is slowly being improved and incorporated in the NASA
opinions when appropriate. Therefore, the results could                    thinking about risk ranking and prioritization of upgrades.
probably be improved by adopting consistently a Bayesian                   Where it is resisted, it is often because it runs against the
approach, using all existing data. For example, surrogate                  engineering tradition of safety factors and suspicion about
data can be used as priors to be updated based on additional               the use of Bayesian probability. Yet, if one wants to assess
experience and new ¯ight data. In any case, failure prob-                  the risk, the Bayesian approach is unavoidable because there
abilities must be assessed differently if they represent                   are seldom enough data for a classical statistical analysis.
marginal or conditional probabilities, in which case the                   PRA has now been adopted as one of the decision supports
events on which they depend must be considered.                            for the management of the space shuttle, of the space station
   Finally, in the current PRAs, the simplifying choice was                and of some unmanned space missions. In the long term, this
made to compute ®rst-order probabilities only. The results                 decision will improve the consistency and the ef®ciency of
are thus represented by the probability (or mean future                    the management of NASA's space systems. As usual, in this
frequence) of different potential system states, based on                  early phase of the PRA modeling, several problems still
the probabilities of different hypotheses or models, and of                need to be addressed. Some of them are essentially organi-
parameter values given these different models [47,48]. In                  zational (i.e., the work is divided among several space
contrast, in a second-order uncertainty analysis, the uncer-               centers). But the most important issues are the need for an
tainties about the possible underlying hypotheses or models                overarching model and fundamental consistency in the
are propagated throughout the analysis. The results are                    choice of method of problem structure, analytical depth
probability distributions of the probabilities (or future                  and treatment of data. As always, the value of the analysis
frequencies) of different system and subsystem states. A                   will be determined by the use of the information that it
®rst-order probability analysis is suf®cient to set priorities             provides, and NASA should realize an improvement in deci-
when the ranking criterion is the mean future frequency.                   sion-making based on quantitative assessment rather than
Yet, an assessment of uncertainties in the input (i.e., failure            intuition and guesses at the system level.
frequency distributions for the basic components) and
consistent propagation of these uncertainties in the analysis
                                                                           References
could permit, in addition, an assessment of the effects of
uncertainties on the results and on priorities.                             [1] Fragola JR. Risk Management in US Manned Spacecraft: From
   Many of these simplifying assumptions will be unneces-                       Apollo to Alpha and Beyond. Proceedings of ESA Product Assurance
sary after the completion of the QRAS software. QRAS is                         Symposium and Software Product Assurance Workshop, Noordwijk,
currently being updated to eventually involve features such                     Netherlands, March 19±22, 1996
as fault tree analysis, an overarching model, external events,              [2] Feynman R. Personal Observations on the Reliability of the Shuttle,
                                                                                Appendix IIF. In: Rogers, et al., 1986.
human errors and adequate Bayesian treatment of all avail-                  [3] Bowles JB. The New SAE FMECA Standard. Proceedings of the
able information. The current experience is probably a                          Annual Reliability and Maintainability Symposium 1998:48±53.
necessary step towards the realization that such features                   [4] Little®eld ML. FMEA/CIL Implementation for the Space Shuttle
(among others) are needed to provide results that are cred-                     New Turbopumps. Proceedings of the Annual Reliability and Main-
ible in absolute terms, and in relative terms, permit ranking                   tainability Symposium 1996:48±52.
                                                                            [5] Onodera K. Effective Techniques of FMEA at Each Life-Cycle Stage.
of upgrades by order of cost-effectiveness.                                     Proceedings of the Annual Reliability and Maintainability Sympo-
   NASA's experience in this respect is not unique. The US                      sium 1997:50±6.
Nuclear Regulatory Commission has taken a long time to                      [6] Agarwala AS. Reliability Engineering in Defense and Aerospace ± A
352                             E. PateÂ-Cornell, R. Dillon / Reliability Engineering and System Safety 74 (2001) 345±352

     Transition to the Commercial World. Communications in Reliability,         [27] National Research Council (NRC). Post-Challenger Evaluation of
     Maintainability, and Supportability 1994;1(1):14±9.                             Space Shuttle Risk Assessment and Management. Committee on
 [7] Davison M, Vantine WL. Understanding Risk Management: A                         Shuttle Criticality Review and Hazard Analysis Audit of the Aero-
     Review of the Literature and Industry Practice. European Space                  nautics and Space Engineering Board, National Academy of Sciences,
     Agency Risk Management Workshop, ESTEC, March 30±April 2,                       National Research Council, National Academy Press, Washington,
     1998:253±6.                                                                     DC, January, 1988.
 [8] Frank M. A Survey of Risk Assessment Methods from the Nuclear,             [28] Slay, et al. Space Shuttle Risk Assessment Proof-of-Concept Study,
     Chemical, and Aerospace Industries for Applicability to the Priva-              Auxiliary Power Unit and Hydraulic Power Unit Analysis Report.
     tized Vitri®cation of Hanford Tank Wastes. Report to the Nuclear                McDonnell Douglas Corp., December 18, 1987.
     Regulatory Commission, August, 1998.                                       [29] Plistiras J., et al. Space Shuttle Main Propulsion Pressurization
 [9] Frank M. Assessment of the Cassini Mission Nuclear Risk with Alea-              System Probabilistic Risk Assessment, Final Report. Lockheed
     tory and Epistemic Uncertainties. Proceedings of the 4th International          Corporation, Palo Alto, CA, 1988.
     Conference on Probabilistic Safety Assessment and Management.              [30] Buchbinder B. Independent Assessment of Shuttle Accident Scenario
     September 13±18, 1998.                                                          Probabilities for the Galileo Mission, Volume 1. NASA/HQ Code QS,
[10] Frank M. Personal correspondence describing NASA project work.                  Washington DC, 20546, April, 1989.
     October, 1998.                                                             [31] SAIC. Probabilistic Risk Assessment of the Space Shuttle Phase 1:
[11] Guarro S, Bream B, Rudolph LK, Mulvihill RJ. The Cassini mission                Space Shuttle Catastrophic Failure Frequency Final Report, 1993.
     risk assessment framework and application techniques. Reliability          [32] PateÂ-Cornell ME, Fischbeck PS. Probabilistic risk analysis and risk-
     Engineering and System Safety 1995;49:293±302.                                  based priority scale for the tiles of the space shuttle. Reliability Engi-
[12] Jet Propulsion Laboratory (JPL). Cassini Recerti®cation Review, JPL             neering and System Safety 1993;41:221±38.
     Internal Document D-11715, 2. Pasadena, California: Jet Propulsion         [33] PateÂ-Cornell ME, Fischbeck PS. PRA as a management tool: organi-
     Laboratory, 1994.                                                               zational factors and risk-based priorities for the maintenance of the
[13] Miles R. Personal correspondence describing NASA project work,                  tiles of the space shuttle orbiter. Reliability Engineering and System
     July, 1998.                                                                     Safety 1993;41:239±57.
[14] Mulvihill RJ. Personal correspondence describing NASA project              [34] SAIC. Probabilistic Risk Assessment of the Space Shuttle, 1995.
     work, July, 1999.                                                          [35] U.S. Nuclear Regulatory Commission (USNRC). Reactor Safety
[15] Railsback J. Personal correspondence describing NASA project work,              Study: Assessment of Accident Risk in U.S. Commercial Nuclear
     July, 1998.                                                                     Plants, WASH-1400 (NUREG-75/014). Washington, DC: U.S.
[16] Shemanski T, Silke K. Reliability Growth Model Overview. Relia-                 Nuclear Regulatory Commission, 1975.
     bility Bulletin 92-02, General Dynamics Space Systems Division,            [36] U.S. Nuclear Regulatory Commission (USNRC). PRA Procedures
     1992.                                                                           Guide, NUREG/CR-2300. Washington DC: U.S. Nuclear Regulatory
[17] Silke K, Bennett J. Launch Vehicle Reliability Assessment. Reliabil-            Commission, 1983.
     ity Bulletin 92-01, General Dynamics Space Systems Division, 1992.         [37] U.S. Nuclear Regulatory Commission (USNRC). Procedural and
[18] PateÂ-Cornell ME, Frank MV, Mulvihill RJ, Fragola JR. On the current            Submittal Guidance for the Individual Plant Examination of External
     status of Probabilistic Risk Analysis for the US Space Shuttle, Report          Events (IPEEE) for Severe Accident Vulnerabilities, Final Report.
     to the National Aeronautic and Space Administration, Code Q,                    Washington, DC: U.S. Nuclear Regulatory Commission, 1991.
     Washington D.C., February, 2000.                                           [38] U.S. Nuclear Regulatory Commission (USNRC). A Technique For
[19] Maggio G. Space Shuttle Probabilistic Risk Assessment: Methodol-                Human Error Analysis (Atheana). Washington, DC: Division of
     ogy and Application. Proceedings of the Annual Reliability and Main-            Systems Technology, Of®ce of Nuclear Regulatory Research, 1996.
     tainability Symposium 1996:121±32.                                         [39] Vesely WE. Fault Tree Handbook. Washington, DC: Of®ce of
[20] Rutledge P, Weinstock R. Quantitative Risk Assessment System                    Nuclear Regulatory Research, 1981.
     (QRAS). Proceedings of the 4th International Conference on Prob-           [40] Mosleh A. Procedure For Analysis Of Common-Cause Failures In
     abilistic Safety Assessment and Management, September 13±18,                    Probabilistic Safety Analysis. Washington DC: Division of Safety
     1998.                                                                           Issue Resolution, Of®ce of Nuclear Regulatory Research, Nuclear
[21] Sa®e FM. An Overview of Quantitative Risk Assessment of Space                   Regulatory Commission, 1993.
     Shuttle Propulsion Elements. Proceedings of the 4th International          [41] Fragola J.R. Space Shuttle Probabilistic Risk Assessment. Proceed-
     Conference on Probabilistic Safety Assessment and Management,                   ings of PSAMIII, Crete, Greece, 1996.
     September 13±18, 1998.                                                     [42] Mosleh A. Personal correspondence describing NASA project work,
[22] Frank M. Applications of Technical Risk Assessment in Aerospace.                September, 1998.
     European Space Agency Risk Management Workshop, ESTEC,                     [43] Mosleh A. Quantitative Risk Assessment System: Software Require-
     March 30±April 2, 1998:43±66.                                                   ment, University of Maryland, CTRS A5-5.1, May, 1998.
[23] Baker J. Space Shuttle Range Safety Hazards Analysis. Technical            [44] Mosleh A. Quantitative Risk Assessment System: Software Design,
     Report 81-1329, prepared for NASA, KSC, J. Baker (author), John                 University of Maryland, CTRS A5-5.2, May, 1998.
     Wiggins Inc., 1981.                                                        [45] Howard RA. Information Value Theory in The Principles and Appli-
[24] Rogers W. et al. Report of the Presidential Commission on the Space             cations of Decision Analysis. Howard RA, Matheson JE (eds.) Palo
     Shuttle Challenger Accident, Washington D.C., 1986.                             Alto, CA: Strategic Decisions Group, 1989.
[25] Kaplan S. On the Inclusion of Precursors and Near-Miss Events in           [46] Matheson JE. The Economic Value of Analysis and Computation. In:
     Quantitative Risk Assessments: A Bayesian Point of View and a                   Howard RA, Matheson JE, editors. The Principles and Applications of
     Space Shuttle Example. Reliability Engineering and System Safety                Decision Analysis, Palo Alto, CA: Strategic Decisions Group, 1989.
     1990;27:103±15.                                                            [47] Helton JC. Treatment of uncertainty in performance assessment for
[26] Pinkus RL, Shuman LJ, Hummon NP, Wolfe H. Engineering Ethics:                   complex systems. Risk Analysis 1994;14:483±511.
     Balancing Cost, Schedule, and Risk- Lessons Learned from the Space         [48] PateÂ-Cornell ME. Uncertainties in risk analysis: Six levels of treat-
     Shuttle. Cambridge: Cambridge University Press, 1997.                           ment. Reliability Engineering and System Safety 1996;54:95±111.