WP_Term Object
(
    [term_id] => 159
    [name] => Siemens EDA
    [slug] => siemens-eda
    [term_group] => 0
    [term_taxonomy_id] => 159
    [taxonomy] => category
    [description] => 
    [parent] => 157
    [count] => 731
    [filter] => raw
    [cat_ID] => 159
    [category_count] => 731
    [category_description] => 
    [cat_name] => Siemens EDA
    [category_nicename] => siemens-eda
    [category_parent] => 157
)
            
Q2FY24TessentAI 800X100
WP_Term Object
(
    [term_id] => 159
    [name] => Siemens EDA
    [slug] => siemens-eda
    [term_group] => 0
    [term_taxonomy_id] => 159
    [taxonomy] => category
    [description] => 
    [parent] => 157
    [count] => 731
    [filter] => raw
    [cat_ID] => 159
    [category_count] => 731
    [category_description] => 
    [cat_name] => Siemens EDA
    [category_nicename] => siemens-eda
    [category_parent] => 157
)

Siemens EDA Discuss Permanent and Transient Faults

Siemens EDA Discuss Permanent and Transient Faults
by Bernard Murphy on 10-05-2022 at 6:00 am

This is a topic worth coverage for those of us who aim to know more about safety. There are devils in the details on how ISO 26262 quantifies fault metrics, where I consider my understanding probably similar to other non-experts: light. All in all, a nice summary of the topic.

Siemens EDA Discuss Permanent and Transient Faults

Permanent and transient faults 101

The authors kick off with a section on “what are they and where do they come from”,. They describe the behavior well enough and a mechanism to model permanent faults (stuck-at). Along with general root causes (EMI, radiation, vibration, etc).

The rest of the opening section is valuable, talking about base failure rates and the three metrics most important to ISO 26262. These are single point fault metric (SPFM), latent fault metric (LFM) and the probabilistic metric for hardware failure (PMHF). These quantify FIT rates (failures in time). Permanent faults affect all three and can be estimated or measured in accelerated life testing.

How do these relate to FMEDA analysis? FMEDA estimates the effectiveness of safety mitigations against transient faults, providing a transient fault component to the SPFM and PMHF metrics. It has nothing to do with permanent faults or LFM metrics. Got that?

Safety mechanisms

There’s a nice discussion on safety mechanisms and their effectiveness in detecting different types of fault. One example they show uses software test libraries (STL), a new concept to me. They note STLs are unlikely to be helpful in detecting transient faults given the fault may vanish during the execution of the test. However, there are multiple mechanisms to help here. Triple modular redundancy and lockstep compute and ECC are examples.

There is an introduction to periodic hardware self-test, becoming more important in ASIL-D compliance for-in-flight block validation. They suggest during such testing that configuration registers could be scrubbed, eliminating transient-induced configuration errors. An interesting idea but I suspect this would need some care to avoid serious overkill in requiring a function to be reconfigured from scratch on each retest. Might be interesting if all the configuration registers have protected restore registers, allowing recovery from a known good and recent state?

More on transient faults

The paper has a good discussion on transient faults in relation to FIT rates. They point out that storage elements are most important, noting that failure rates on combinational elements rarely rise to statistical significance. Transients are about bit flips rather than signal glitches; the effect must persist for some time, if only a clock cycle. Glitches can also cause bad behavior, but the statistical significance of such problems is apparently low.

They extend this argument to the need to pay more attention to registers which are infrequently updated (e.g. configuration registers) versus registers which are frequently updated. On the grounds that a fault in a long-lived value may have more damaging consequences. I understand the reasoning with respect to FIT rate. A long-lived error may cause more faults. But the argument seems a bit loose. An error in a frequently updated register can propagate to memory where it may also live for a long time.

I didn’t learn about fault detection time intervals (FDTI) until relatively recently. The paper has a good discussion on this. Also on fault tolerant time intervals (FTTI). How long do you have after a fault occurs to detect it and do something about it? Useful information for those planning safety mitigations.

You can read the white paper HERE.

Share this post via:

Comments

One Reply to “Siemens EDA Discuss Permanent and Transient Faults”

You must register or log in to view/post comments.