Instrumenting post-silicon validation is not a new idea but here’s a twist. Using (pre-silicon) emulation to choose debug observation structures to instrument in-silicon. Paul Cunningham (GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO) and I continue our series on research ideas. As always, feedback welcome.
This month’s pick is Emulation Infrastructure for the Evaluation of Hardware Assertions for Post-Silicon Validation. The paper was presented at the 2017 IEEE Transactions on VLSI. The authors are from McMaster University, Hamilton, ON, Canada
The authors distinguish between logical and electrical errors post-silicon and devote their attention in this paper to electrical errors, detectable through bit-flips in flops. Their approach is to determine an optimal set of assertions in pre-silicon analysis. These they then implement in silicon in support of post-silicon debug. The pre-silicon analysis is similar to faulting in safety analyses, injecting faults on flops corresponding to electrical errors, as they hint in the paper. They generate a candidate list of assertions using assertion synthesis; the core of their innovation is to provide a method to grade these assertions by how effective each is in detecting multiple faults.
Input generation is random, analyzing injected faults (treated as transient) in sequence. They allow a user-specified number of cycles for detection per fault. In a subsequent phase, they measure effectiveness using two different coverage techniques. For flip-flop coverage, they count an assertion if it catches an injected error on any flop. In bit-flip coverage, they score assertions number of errors detected on separate flops. These metrics, together with area estimates, they use (alternately) to select which preferred assertions.
This paper pairs nicely with our August 2020 blog on quick error detection (QED). QED accelerates post-silicon functional bug detection, where this blog focuses on post-silicon electrical bug detection. The paper is an easy read, although it helps to first read reference .
Electrical bugs are hard to catch, and even then, are hard to replicate and find the underlying physical cause. The authors propose a method, through embedded logic, to detect when such bugs cause a flop to flip to an incorrect value (they don’t dig deeper than finding these flips).
The heart of the paper and its companion reference  is a multi-step method to create and synthesize this detection logic. It begins with mining properties of the design as temporal assertions using the GoldMine tool. They rank assertions based an estimate of their ability to detect bit flips, and an estimate of the area / wiring cost to implement in silicon. Ranking relies on running many pre-silicon simulations with candidate assertions, injecting bit flip errors and counting detected flips by assertions. In the original paper they used logic simulation, here they accelerate these simulations by mapping the design to an Altera FPGA board.
I like how they pull together several innovations into a coherent method for post-silicon bit flip detection: assertion mining, assertion synthesis, and an elegant ranking function for assertion selection. However, the results section of the paper indicates that detecting bit flips in n% of the flip-flops requires roughly an n% increase in design area. This seems challenging for commercial application, especially since it only helps find electrical bugs. One could potentially achieve a similar result by cloning the logic-cone driving a flip-flop, then compare the output of this cloned logic to the original logic. This would seem to generate a similar area overhead as their method, in the limit cloning the entire design (i.e. 100% area overhead) to detect flips in 100% of the flops in the design.
The paper is self-contained with a fair amount of detail. The authors ran experiments for 3 ISCAS sequential circuits (approx. 12K gates, 2000 FF). Preparation experiments inject 256 errors per flip flop and using all assertions generated by GoldMine. Due to the limited capacity of the FPGA the authors split runs unto 45 “sessions” for one circuit. The results show, even with 45 sessions, an acceleration in analysis over simulation of 20-500 times (only up to 8 error injections because simulation gets too slow, 105h). The maximum achievable Flip-Flop coverage is 55%, 89% and 99% for the 3 circuits. The number of assertions mined controls coverage.
Running with selected assertions (corresponding to a 5-50% area overhead) and 1-256 injections results in 2.2%-34% bit coverage. Most of the time, the assertion miner ran for 228h. One thing that confused me is their data for run-times versus errors injected. The increase looks reasonable (linear) in simulation. But in emulation it jumps massively, from 0.045h to 5.4h for an increase of 2 to 8 error injections. I’d like more explanation on this point.
This is a methodology paper. I like that pretty much every step can be substituted by a commercial tool. Together with using a large FPGA board (as emulator) the methodology scales. Methodologies are of course very hard to commercialize, but it’s a nice application for existing technology!
The method of exploring a safety analysis technique for post-silicon debug is intriguing. A novel idea, even though leading to a somewhat impractical result for commercial application.
Also ReadShare this post via: