WP_Term Object
(
    [term_id] => 15
    [name] => Cadence
    [slug] => cadence
    [term_group] => 0
    [term_taxonomy_id] => 15
    [taxonomy] => category
    [description] => 
    [parent] => 157
    [count] => 572
    [filter] => raw
    [cat_ID] => 15
    [category_count] => 572
    [category_description] => 
    [cat_name] => Cadence
    [category_nicename] => cadence
    [category_parent] => 157
)
            
14173 SemiWiki Banner 800x1001
WP_Term Object
(
    [term_id] => 15
    [name] => Cadence
    [slug] => cadence
    [term_group] => 0
    [term_taxonomy_id] => 15
    [taxonomy] => category
    [description] => 
    [parent] => 157
    [count] => 572
    [filter] => raw
    [cat_ID] => 15
    [category_count] => 572
    [category_description] => 
    [cat_name] => Cadence
    [category_nicename] => cadence
    [category_parent] => 157
)

Fault Sim on Multi-Core Arm Platform in China. Innovation in Verification

Fault Sim on Multi-Core Arm Platform in China. Innovation in Verification
by Bernard Murphy on 04-24-2024 at 6:00 am

How much can running on a multi-core (Arm) CPU speed up fault simulation? Paul Cunningham (GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and now Silvaco CTO) and I continue our series on research ideas. As always, feedback welcome.

Fault Sim on Multi-Core Arm Platform in China

The Innovation

This month’s pick is Fault Simulation Acceleration Based on ARM Multi-core CPU Architecture. This article was published in the 2023 IEEE Asian Test Symposium. The authors are from HiSilicon and Huawei.

This paper on fault simulation throughput exploits parallelism on a multi-core CPU. Curiously there is no mention of safety applications in this or in a recent reference they cite, suggesting an enduring interest in China for fault sim for regular test grading, here I would imagine for communications systems? The authors mention GPU-based and distributed compute as acceleration alternatives but note these suffer from multiple drawbacks. In contrast, they claim their proposed solution using 128 cores is much easier to program and offers meaningful acceleration.

Paul’s view

Verification of test patterns is an N x M style problem where N is the number of patterns and M is the set of possible faults (stuck-at-1, stuck-at-0, …). Each pattern-fault pair can be simulated in parallel, but for commercial scale designs N x M is in the billions so there is still a massive amount of serialization of sims even if thousands of CPU cores can be allocated.

This paper shares 3 insights parallelizing pattern-fault sims on modern high core-count Arm servers to maximize throughput. Results are presented on a 128-core Huawei Kunpeng 920 server.

The first insight relates to vectorizing faults, what commercial EDA tools call “concurrent fault simulation”. A 64-bit word can be used to represent the value on a wire across 64 different fault simulations. The authors observes that the SIMD capabilities in the Arm NEON unit can be used to increase the number of concurrent fault sims per core from 64 to 128. This gives a ~1.6x speed-up.

The second insight relates assigning pattern-fault pairs to cores. The authors observe that it’s better to parallelize patterns across cores rather than faults across cores. This gives an impressive 2.2x speed-up.

Lastly the authors observe that the 128 cores are split across 4 dies, each die with a direct link to “local” DRAM memory. Any core in any die can access DRAM from any die, but the latency for local DRAM access is 3-4x faster. By replicating the design data (which is constant and shared across all sims) in local DRAM for each die they get a 1.2x speed-up.

Overall, tight paper with clear insights and real benefits directly applicable to commercial EDA use today. Nice.

Raúl’s view

Fault simulation can be accelerated by simulating faults or test patterns in parallel; faults are independent of each other as are test patterns. This paper evaluates simulating faults and test patterns in parallel on a specific non-uniform memory access (NUMA) architecture, Kunpeng 920. It consists of 2 CPUs with two nodes each, each node having 32 ARM cores. The local node memories have varying access times depending on which node is accessing them.

The paper explains all the methods used to accelerate simulations: As usual many bits are packed into a word; using a particular data type in the ARM NEON architecture, 64 or 128 patterns can be simulated simultaneously; Execution threads are bound to cores and memory based on the memory access delay (binding optimization). The simulated netlist is replicated to optimize cross-node memory access; Fault data is segmented and allocated to the four memories.

The experimental results for 5 circuits (ITC99 and IWLS2005 benchmarks and industrial circuits, presumably their own) show that, as expected, pattern parallelism is faster than fault parallelism by a factor of 1.11-3.74, around 2 on average. This is because in pattern parallelism fault simulations can be started right after each pattern is simulated correctly (on the faultless design), while in fault parallelism first all patterns must be simulated correctly. However, pattern parallelism consumes more memory. Other results reported are: Parallel simulation of 128 patterns is about 1.6 times quicker than 64; Binding optimization gives 1.06x to 1.29x speedup; Lastly, cross-node memory access optimization gives 1.13x to 1.52x speedup.

The paper does not review or compare the state of the art, does not contextualize the work, and makes unsupported claims such as “Compared with the previous technical scheme, the ARM multi-core CPU used in this paper has the advantages of low cost and low energy consumption…”. I found the paper a valuable report that contains many details of the implementation, and a helpful insight into how to speed up fault simulation.

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.