Cicero said: “The shifts of fortune test the reliability of friends.” Fortunately, the reliability of processes can be more easily understood than friends. It is defined as the probability of performing a specific function under certain conditions for a specific period of time without failure. It is composed of

- a probability of performing without failure (probability)
- definition of the failure
- definition of the function it performs
- the conditions under which is must perform
- the time duration

Failure types are not all equal. Censoring is used to define how these types occur. Right-censored data occurs when a unit did not fail during the study, so that you don’t know when the failure will actually occur. Left-censored data indicates that the unit failed during the study but you don’t know when, perhaps it occurred before inspection. Interval-censored data occur when data is taken at regular times, but you don’t have an exact time of failure.

**ENTRIES OPEN: **Establish your company as a technology leader. For 50 years, the R&D 100 Awards, widely recognized as the “Oscars of Invention,” have showcased products of technological significance. Learn more.

The types of censoring affect the type of distribution. A mix of failures and right-censoring is fit by Kaplan-Meier method. For left-censoring, interval or mixed, Turnbull estimates are used. Common parametric distributions are

- lognormal — the product of many positively independently distributed variables
- Weibull — used with changing hazard rates
- Frechet — similar to Weibull and used with extreme values
- loglogistic — used with non-monotonic hazard function
- exponential — with components wearing out beyond their expected lifetimes

Several of these distributions (Weibull, lognormal, Frechet, loglogistic) can have a threshold parameter added to shift the distribution away from zero to account for units that survive a threshold value. In situations where a fraction of the population has the defect leading to failure, defective subpopulation-type modifications of these distributions can be done. When a given proportion of units fail at zero time, zero-inflated modifications of these distributions can be done. Comparisons of different types of distributions are done with loglikelihood, AICc (corrected Akaike’s information criterion), or BIC (Bayesian information criterion). A comparison of the Weibull and exponential fit using a life distribution can be seen in Figure 1.

The confidence interval for distribution parameters can be done with either the Wald or likelihood method. Multiple comparisons of causes of failure are important to determine and, often, different distribution types need to be used for different failure types. In situations where there is only one factor to model, lifetime to events, the relationship between the event and the factor can be analyzed using transformations, such as the Arrhenius, linear, log, logit, reciprocal, square root, or the Box-Cox. Often, a unit breaks down, is repaired, and put back into service. A cumulative count of events is analyzed. This may be cost-based or not. Multiple units can be compared for deterioration over time using various types of transformations mentioned previously, and estimates of time to failure can be done with confidence intervals. A comparison of three units using a degradation model with predictions at 250 with confidence intervals is seen in Figure 2.

Once a process is in production, a forecast of future failures can be estimated using production and failure dates along with production volume. Production dates and counts of failures within specified periods are stored in the Nevada format, thusly named since a graph of the data resembles the state. Returned units are considered interval censored, where the interval is between the last recorded time and the time the failure was seen. Left censoring needs to be accounted for. Details of distribution type, forecasting intervals, contract lengths and approximation methods, including the size of Monte Carlo simulations, are needed. A reliability forecast can be seen in Figure 3.

As changes are implemented, system reliability can be modeled to increase the mean time between failures (MTBF) by capturing fixes in different phases of production. A Duane plot shows Cumulative MTBF estimates plotted against the Time to Event variable. If the data follow the Duane model, the points should follow a line when both axes are log transformed. When there are at least two phases, a piecewise Weibull NHPP (non-homogeneous Poisson process) model can be fit. These models are used for repairable systems. Another type of NHPP model used within a phase is the Crow AMSAA (U.S. Army Materiel Systems Analysis Activity), which has failure intensity as a function of time. Fixed parameter and bias corrected versions of the growth parameter for Crow AMSAA are also available. A reliability growth model is shown in Figure 4.

When the reliability distribution of system components and their relationships are known, reliability blocks with dependencies can be constructed to model reliability behavior.

It should be mentioned that survival analysis uses similar models to those used by reliability analysis. Survival analysis is used for drug and medical treatment analysis.

Fitting of different distributions and the judicious use of transformations and fitting equations can be used to understand and improve processes during the research phase. Upon release to the market, reliability models based on MTBF can be used to analyze and improve products on market.

__Note__: All graphs were generated using JMP v.11.2.0 software.

*Mark Anawis is a Principal Scientist and ASQ Six Sigma Black Belt at Abbott. He may be reached at mark.anawis@abbott.com*