WP_Term Object
(
    [term_id] => 15
    [name] => Cadence
    [slug] => cadence
    [term_group] => 0
    [term_taxonomy_id] => 15
    [taxonomy] => category
    [description] => 
    [parent] => 157
    [count] => 600
    [filter] => raw
    [cat_ID] => 15
    [category_count] => 600
    [category_description] => 
    [cat_name] => Cadence
    [category_nicename] => cadence
    [category_parent] => 157
)
            
14173 SemiWiki Banner 800x1001
WP_Term Object
(
    [term_id] => 15
    [name] => Cadence
    [slug] => cadence
    [term_group] => 0
    [term_taxonomy_id] => 15
    [taxonomy] => category
    [description] => 
    [parent] => 157
    [count] => 600
    [filter] => raw
    [cat_ID] => 15
    [category_count] => 600
    [category_description] => 
    [cat_name] => Cadence
    [category_nicename] => cadence
    [category_parent] => 157
)

Fault Simulation for AI Safety. Innovation in Verification

Fault Simulation for AI Safety. Innovation in Verification
by Bernard Murphy on 03-27-2024 at 6:00 am

More automotive content 😀

In modern cars, safety is governed as much by AI-based functions as by traditional logic and software. How can these functions be fault-graded for FMEDA analysis? Paul Cunningham (GM, Verification at Cadence), RaĂșl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and now Silvaco CTO) and I continue our series on research ideas. As always, feedback welcome.

Fault Simulation for AI Safety Grading

The Innovation

This month’s pick is SiFI-AI: A Fast and Flexible RTL Fault Simulation Framework Tailored for AI Models and Accelerators. This article was published in the 2023 Great Lakes Symposium on VLSI. The authors are from the Karlsruhe Institute of Technology, Germany.

ISO 26262 requires safety analysis based on FMEDA methods using fault simulation to assess sensitivity of critical functions to transient and systematic faults, and the effectiveness of mitigation logic to guard against errors. Analysis starts with design expert understanding of what high-level behaviors must be guaranteed together with what realistic failures might propagate errors in those behaviors.

This expert know-how is already understood for conventional logic and software but not yet for AI models (neural nets) and the accelerators on which they run. Safety engineers need help exploring failure modes and effects in AI components to know where and how to fault models and hardware. Further that analysis must run at practical speeds on the large models common for DNNs. The authors propose a new technique which they say runs much faster than current methods.

Paul’s view

A thought provoking and intriguing paper: how do you assess the risk of random hardware faults in an AI accelerator used for driver assist or autonomous drive? AI inference is itself a statistical method, so determining the relationship between a random bit flip somewhere in the accelerator and an incorrect inference is non-trivial.

This paper proposes building a system that can “swap in” a real RTL simulation of a single layer of a neural network, an otherwise pure software-based inference of that network in PyTorch. A fault can be injected into the layer being RTL simulated to assess the impact of that fault on the overall inference operation.

The authors demonstrate their method on the Gemmini open-source AI accelerator running ResNet-18 and GoogLeNet image classification networks. They observe each element of the Gemmini accelerator array has 3 registers (input activation, weight and partial sum) and a weight select signal, together 4 possible types of fault to inject. They run 1.5M inference experiments, each with a random fault injected, checking if the top-1 classification out of the network is incorrect. Their runtime is an impressive 7x faster than prior work, and their charts validate the intuitive expectation that faults in earlier layers of the network are more impactful than those in deeper layers.

Also, it’s clear from their data that some form of hardware safety mechanism (e.g. triple-voting) is warranted since the absolute probability of a top-1 classification error is 2-8% for faults in the first 10 layers of the network. That’s way too high for a safe driving experience!

RaĂșl’s view

The main contribution of SiFI-AI is simulating transient faults in DNN accelerators combining fast AI inference with cycle-accurate RTL simulation and condition-based fault injection. This is 7x faster than the state of the art (reference 2, Condia et al, Combining Architectural Simulation and Software Fault Injection for a Fast and Accurate CNNs Reliability Evaluation on GPUs). The trick is to simulate only what is necessary in slow cycle-accurate RTL. The faults modeled are single-event upset (SEU), i.e., transient bit-flips induced by external effects such as radiation and charged particles, which persist until the next write operation. To find out whether a single fault will cause an error is especially difficult in this case; the high degree of data reuse could lead to significant fault propagation, and fault simulation needs to take both the hardware architecture and the DNN model topology into account.

SiFI-AI integrates the hardware simulation into the ML framework (PyTorch). For HW simulation it uses Verilator, a free and open-source Verilog simulator, to generate cycle accurate RTL models. A fault controller manages fault injection as directed by the user, using a condition-based approach, i.e., a list of conditions that avoid that a fault is masked. To select what part is simulated in RTL, it decomposes layers into smaller tiles based on “the layer properties, loop tiling strategy, accelerator layout, and the respective fault” and selects a tile.

The device tested in the experimental part is Gemmini, a systolic array DNN accelerator created at UC Berkeley in the Chipyard project, in a configuration of 16×16 processing elements (PE). SiFI-AI performs a resilience study with 1.5 M fault injection experiments on two typical DNN workloads, ResNet-18 and GoogLeNet. Faults are injected into three PE data registers and one control signal, as specified by the user. Results show a low error probability, confirming the resilience of DNNs. They also show that control signal faults have much more impact than data signal faults, and that wide and shallow layers are more susceptible than narrow and deep layers.

This is a good paper which advances the field of DNN reliability evaluation. The paper is well-written and clear and provides sufficient details and references to support the claims and results. Even though the core idea of combining simulation at different levels is old, the authors use it very effectively. Frameworks like SciFI-AI can help designers and researchers optimize their architectures and make them more resilient. I also like the analysis of the fault impact on different layers and signals, which reveals some interesting insights. The paper could be improved by providing more information on the fault injection strategy and the selection of the tiles. Despite the topic being quite specific, overall, a very enjoyable paper!

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.