IC Mask SemiWiki Webinar Banner
WP_Term Object
(
    [term_id] => 157
    [name] => EDA
    [slug] => eda
    [term_group] => 0
    [term_taxonomy_id] => 157
    [taxonomy] => category
    [description] => Electronic Design Automation
    [parent] => 0
    [count] => 4037
    [filter] => raw
    [cat_ID] => 157
    [category_count] => 4037
    [category_description] => Electronic Design Automation
    [cat_name] => EDA
    [category_nicename] => eda
    [category_parent] => 0
)

When is a Million-Year MTBF Too Short?

When is a Million-Year MTBF Too Short?
by Jerry Cox on 07-21-2014 at 8:00 am

The reliability metric, Mean Time Between Failures (MTBF), is often misunderstood. Use of an MTBF metric generally assumes a random failure process, one that is very infrequent and has no memory of past failures. Such failure modes can occur in System-on-Chip (SoC) designs and include radiation effects, synchronizer malfunctions at clock domain crossings as well as other rare failures triggered by highly unusual combinations of events. In such a random failure process, 63% of the systems described by a particular value of MTBF will have failed before the end of the MTBF period. Viewing the system failures as occurring at “one MTBF” is quite misleading.

To investigate the impact of using MTBF to describe such SoC failures in safety-critical products, let us create a product story that brings the business issues into focus. Consider a mass-produced product (for example, millions of cars/year), each with the same specific failure-risk that impacts multiple models and endures through multiple model years. We wish to examine the aggregate liability resulting from fatalities arising from such a defect and will use the Toyota Sudden Unintended Acceleration (SUA) experience to assess liability (40,000 SUA reports and over 200 claims, with two initial claims settled for $1.5M each).

As Toyota and a surprising number of other automobile manufacturers have found, these losses can grow dramatically as long as the defect is not eliminated in succeeding model years. To simplify the analysis, assume that wrongful-death suits are settled for S = $1.5M and that cars with the defect are produced with a total volume of V = 5M cars/year over all affected models. The probability that the defect leads to a fatal accident within a year is a very small number p. In the first year of production of cars with the defect, the estimated liability loss will be ½SVp since the average sale date will be halfway through the first year. In the second year of production, essentially all the cars from the first year are still on the road, with estimated liability loss SVp, but now a second cohort of cars has been added, with estimated liability loss of ½SVp. By the end of the second year the total estimated losses sum to 2SVp.

To extend the estimate to later years consider the following table showing the annual estimated liability losses (rows) for vehicles sold in a given year (columns).

As noted above, at the end of the second year the total estimated loss is 2∙SVp (shown within the smaller oval) and at the end of the third year it is 4.5∙SVp (shown in the larger oval). The grand total liability after N years can be found by summing all N[SUP]2[/SUP] cells in the table (note that the average loss per cell is ½SVp).

Thus, the losses grow quadradically with the number of years that the defect fails to be remedied. This is a sad fact that the automotive sector is learning the hard way.

We have assumed that the defect is one that can lead to a fatal event at any time. As a result, p=1/MTBF is the probability that, on average, one fatal event occurs within the time window corresponding to the mean time between failures. Because p is a very small number and because we assume no recall, the cars removed from the road because of a fatal event during a year have a negligible effect on the total cars on the road in later years.

This product story ignores several important issues:

  • Damage to the manufacturer’s reputation and other settlements may have a greater impact on the company than the liability losses. In the Toyota SUA example, settling 200 cases may cost $300M, but the liability associated with legal settlements based on the decreased resale value of Toyota vehicles is expected to exceed $2B.
  • Not all SoC failures result in serious accidents. From the Toyota SUA data, only one in 200 complaints resulted in a suit against the company.
  • An SoC may have many independent instances of the same kind of defect increasing the probability of failure pproportionately.
  • Failures and liability are likely to increase with time even more than predicted above as a result of transistor aging.
  • A 1-month post fabrication test has less than a 1% chance of detecting an SoC that will fail in its 10-year lifetime.

Providing one is mindful of these caveats to our product story, the average resulting losses in millions of dollars in each of the years after introduction of the defect and in the subsequent years before its mitigation are shown in the figure below:

Thus, after 10 years the predicted losses are $375M resulting from a fatal SoC defect with a million-year MTBF (p =0.000001). Even with a billion-year MTBF (p = 0.000000001) the average 10-year losses are $0.4M. In summary, unless it is certain that a product has an MTBF of greater than billion years, there is both a business and ethical case to ensure that such failures are effectively minimized, carefully monitored and their undesirable effects reliably mitigated. Because of the rapid growth in liability, manufacturers should develop a replacement part and issue a recall with the minimum of delay.

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.