Optimal Mission Abort Policy for Systems Operating in a Random Environment
Abstract
Many real-world critical systems, e.g., aircrafts, manned space flight systems, and submarines, utilize mission aborts to enhance their survivability. Specifically, a mission can be aborted when a certain malfunction condition is met and a rescue or recovery procedure is then initiated. For systems exposed to external impacts, the malfunctions are often caused by the consequences of these impacts. Traditional system reliability models typically cannot address a possibility of mission aborts. Therefore, in this article, we first develop the corresponding methodology for modeling and evaluation of the mission success probability and survivability of systems experiencing both internal failures and external shocks. We consider a policy when a mission is aborted and a rescue procedure is activated upon occurrence of the mth shock. We demonstrate the tradeoff between the system survivability and the mission success probability that should be balanced by the proper choice of the decision variable m. A detailed illustrative example of a mission performed by an unmanned aerial vehicle is presented.
1. INTRODUCTION
Most existing system reliability models mainly deal with assessing the probability of performing a required function by a system under the given operational conditions and for a specified period of time.1 Another conventional index is the mission success probability, i.e., the probability of completing a specific mission with or without a deadline.2 However, in practice, there often exist situations when survival of a system, due to safety or cost-related reasons, may have a higher priority than accomplishing the defined mission. In these cases, a mission abort policy can be implemented in order to improve the system survivability and thus to decrease the risk of casualties and/or of substantial economic losses.
A mission is usually aborted when a certain malfunction or incident condition (e.g., external impacts) is satisfied and a safe rescue or recovery procedure is initiated.3 A real-world example of the described scenario is an aircraft that can be required to abort a mission after a certain number of external impacts associated with malicious activity or nature conditions (e.g., lightning inducing electrical peaks in the electrical circuits). These impacts can cause deterioration of critical systems that makes the risk associated with the mission continuation unacceptable.
Traditional system reliability models typically cannot address a possibility of mission aborts while evaluating and optimizing reliability characteristics of engineering systems. In this article, we make a novel contribution by modeling and evaluating the mission success probability and survivability of systems operating in a random environment and subject to mission aborts. An impact of environment is modeled by an external shock process. In our model, shocks affect the system failure rate directly, increasing it a constant amount with each event. We consider a policy when a mission is aborted and a rescue procedure is activated immediately after the mth shock.
Reliability analysis of systems with mission abort policies is a rather new and practically important topic addressed only in a couple of papers so far. In the pioneering paper by Myers,4 the author considered standby systems with an abort policy and a rescue procedure to be initiated upon the failure of a fixed number of components. The corresponding method was developed only for homogeneous, hot standby systems with identical components and exponential time-to-failure distributions. In Levitin et al.,5 the model was extended to heterogeneous systems and the adaptive abort policy. However, these papers do not take into consideration the influence of a stochastic environment on operational characteristics of systems and the corresponding abort policy. Neglecting the effect of a random environment and considering only static models can lead to substantial discrepancies in assessing reliability and safety characteristics of various engineering systems.
There is an extensive literature on shocks modeling in reliability and risk analysis (see, e.g., the following monographs: Nakagawa,6 Finkelstein,7 Finkelstein and Cha8). Traditionally, one distinguishes between two major types of shock models: the cumulative shock models when systems fail due to some cumulative effect and the extreme shock models when systems can fail with certain probabilities upon any shock (Klefsjo,9 Mallor and Omey,10 Gut and Husler,11 Cha and Finkelstein,12 to name a few). In this article, we consider a practically important model when shocks effect the failure rate directly (Cha and Mi,13 Lemoine and Wenocur).14, 15 To the best of our knowledge, there are only a few papers in the literature that consider the number of shocks experienced by a system as a decision parameter for some optimization problems (see, e.g., Finkelstein and Gertsbakh).16 Our challenge in this article is to perform the corresponding analysis for systems with a possibility of a mission abort.
The rest of the article is organized as follows. Section 2. presents the problem formulation. Section 3. defines the corresponding failure model. In Section 4., we derive the mission success probability and the system survivability. Section 5. presents an illustrative example and the corresponding analysis. Section 6 concludes the article and outlines possible directions for future research.
2. PROBLEM FORMULATION
Consider a system that performs a mission task that requires continuous operation during the time τ. Thus, for the mission completion, a system should be operable in [0, τ). Let the lifetime of a system in a baseline environment be described by the cdf
with the corresponding failure rate
. However, during a mission, a system can be exposed to shocks of different nature that decrease its lifetime and, consequently, the mission success probability as well. In this article, we assume that shocks occur in accordance with the nonhomogeneous Poisson process (NHPP)
}, with rate
, where
is a random number of shocks in [0,t) and
are the corresponding (random) shock arrival times. In the model to be described, shocks affect the failure rate of a system directly, increasing it and, therefore, reducing the lifetime.
As was mentioned above, there often exist situations in practice when survival of a system, due to safety or cost-related reasons, may have a higher priority than accomplishing the defined mission. This is obviously the case for safety critical technological processes, experiments, aircrafts, manned space missions, and submarines. In these cases, a mission abort policy can be implemented to improve its survivability. Thus, when the successful mission completion becomes unlikely, a mission should be aborted and a rescue procedure that requires less time than the remaining mission time should be implemented. When damage from shocks is cumulative, and shocks are observable, it is reasonable to consider a number of shocks experienced by a system as the corresponding decision variable. Thus a mission should be aborted upon experiencing m shocks, and the problem is to define this number in an optimal way.
We will first describe the relevant issues regarding the mission abortion and completion and then will address the suggested survival model. It is natural to assume that the time of the rescue procedure is a function of the occurrence time of the mth shock, i.e., φ = φ(tm), where is the realization of the random
(see an example of the nonmonotonic function φ(tm) in Section 5.). When
increases, the remaining mission completion time
decreases and eventually, φ(tm) becomes larger than
. Thus, it becomes unreasonable to start the rescue procedure if it takes more time than the remaining mission time. Therefore, we assume that the system continues executing the mission if φ(tm)≥τ-tm. Note that we assume also that during the mission time and the rescue stage the same lifetime model holds, which means that the rate of the NHPP of shocks is the same function of time during the primary mission and the rescue procedures. The scenario when, for instance, this rate for the rescue phase is smaller than that for the mission phase can be also of interest and we plan to consider this case in future research.

The value of ξ can be obtained for each specific setting (see the example in Section 5.).
The mission succeeds if the system does not fail in [0, τ) and less than m shocks occur in this interval of time (no mission abort). Notice that in accordance with this definition, the mission still can succeed if , as it is not aborted in this case. In accordance with the above, the mission success probability is R(τ,ξ,m) = Pr(L>τ,Tm≥ξ).

When the decision parameter m is increasing, is increasing in the sense of the usual stochastic order (Shaked and Shantikumar,17 Finkelstein)7 and, therefore, the mission success probability R(τ,ξ,m) is increasing (because the abort probability is decreasing), whereas the system survivability S(τ,ξ) is decreasing. Specifically, when m = 0 (
), the system does not perform the mission task and only executes the rescue procedure, which results in R(τ,ξ,0) = 0 and S(τ,ξ,0) = Pr(L>φ(0)). On the other hand, for m = ∞, the system never performs the rescue procedure and survives only if the mission is successfully completed, which gives: R(τ,ξ,∞) = S(τ,ξ,∞) = Pr(L>τ).


3. FAILURE MODEL







where expectation is obtained with respect to the process and conditionally on survival in [0, t). Thus, when the baseline failure rate is known, an unobservable parameter η can be estimated from the test failure data with a controlled rate v.
4. MISSION SUCCESS PROBABILITY AND SYSTEM SURVIVABILITY
Recall that L denotes the lifetime of a system that is described by the lifetime model.5 The proof of the following supplementary result can be found in Cha and Finkelstein.18
Proposition 1.The joint distribution of is given by:


4.1. Probability of the Successful Completion of the Rescue Procedure











4.2. Mission Success Probability














5. ILLUSTRATIVE EXAMPLE
Consider an example of an unmanned aerial vehicle (UAV) that should fly from location a to location d performing a surveillance mission (Fig. 1). The distance between the locations that should be covered by the UAV to fulfill the mission is 1,250 km. The UAV speed is 212.5 km/h. Thus the mission time is τ = 1250/212.5 = 5.88 h. There are two safe landing fields b and c that can be used for emergency landing along the route. The locations of these fields are shown in Fig. 1. If the flight mission is aborted when the distance covered from the airport a is x, the UAV has to cover the distances, ,
,
and
to reach locations a, b, c, and d, respectively. The distance to the closest location for the emergency landing is defined by
. Fig. 2 shows
as functions of the distance covered by the UAV, x when the decision about the mission abort is made. It can be seen that for x>1062.5,
and the destination location becomes the closest one. Thus, if the mth shock occurs when x>1062.5, the rescue procedure presumes the mission completion. The distance x = 1062.5 corresponds to the flight time ξ = 1062.5/212.5 = 5 hours. Having
and the UAV speed, one can obtain the function
defined in Equation 1.




The baseline failure rate that corresponds to the UAV's lifetime is assumed to be a constant (λ). The UAV is exposed to the external shocks caused by lightning with the constant rate v. These shocks affect the failure rate in accordance with Equation 5.
For analysis of the mission success probability R and the corresponding survivability S, Fig. 3 presents these variables as the functions of the decision parameter m and the shock impact factor η for λ = 0.001 and v = 0.2 (assuming v(t)≡v and λ0(t)≡λ). For convenience of notation in this example, we omit the corresponding arguments, where appropriate.

Figs. 4 and 5 present the solutions of the optimization problem with respect to m: max R(m) s.t. S(m)>0.9 as functions of parameters λ, v, and η. It can be seen that for small failure and shock rates, the UAV survivability remains above the level of 0.9 even without aborting the mission, which corresponds to the optimal value m = ∞, for which R = S.


With increase in shocks impact factor η and in rates λ and v, the mission abort becomes necessary for providing the desired UAV survivability and the optimal number of shocks for the mission to be aborted decreases. On the other hand, starting from certain levels of η, λ and v, the rescue procedure (even if activated after the first shock), cannot provide the UAV survivability above 0.9 (see Fig. 5).
Fig. 6 presents the solutions of the cost minimization problem with respect to m: min C(m) = CF(1 − R(m))+CL(1 − S(m)) for CF = 1, η = 0.03, and v = 0.1 as functions of parameters CL and λ. It can be seen that with the increase of the cost of the UAV loss CL, the optimal value of m decreases, which makes the mission abort more likely. The decrease of m causes the increase in survivability S (at the cost of decreasing R).

6. CONCLUSIONS
This article presents a model for obtaining relevant operational characteristics of a failure prone system that is performing a mission. A system is operating in a random environment modeled by an external shock process affecting its time-to-failure distribution. Each shock results in a constant increment in the failure rate, which is described by the corresponding failure model.
If the mission completion becomes problematic, the mission can be aborted and a rescue procedure is activated. Assuming that the decision about the mission abort is made upon occurrence of the mth shock, we present an original method of evaluating the corresponding mission success probability and the system survivability. Based on the obtained results, one can find the value of optimal m that balances the tradeoff between the mission success probability and the system survivability. We illustrate our findings and approach by considering an example of the unmanned aerial vehicle that should fly from one location to another performing a surveillance mission.
We believe that the considered mission abort model can be applied to different fields ranging from space exploration19, 20 to mining21 and other areas. For instance, to save the extremely expensive drilling equipment, the drilling mission can be aborted at some stages. Our approach can also be hopefully applied in healthcare to make the decisions about urgent treatment withdrawal in the case of shocks. For example, when a biotherapy is used for inflammatory bowel disease treatment, infections (shocks) during the treatment (mission) can cause adverse effect. In this case, the decision about the urgent treatment aborting and applying anti-infection measures can be made based on the comparison between the risk of loss of the therapy effect and the risk of the adverse effect of infections.22 Similar considerations can be applied in more general settings while planning treatments of patients with chronic diseases, as these treatments should be aborted (depending on the comparison of the corresponding risks) in the presence of, e.g., some opportunistic diseases. However, definitely, this topic needs further investigation with the help of the health-care professional.
Further research in this direction can employ other models of shock impact on a system time-to-failure (e.g., considering extreme shock models dependent on the previous shocks history). Scenarios when environment for the mission execution and the rescue procedure are modeled by the shock processes with different rates should be also studied. The combined shock model can also be considered when each shock with a given probability results in a failure, whereas with the complementary probability it increases the failure rate by a constant or random amount.
ACKNOWLEDGMENTS
This work was partly supported by the National Natural Science Foundation of China (No. 61170042) and the Jiangsu Province Development and Reform Commission (No. 2013–883).