A checker tripped in verification. Is there a bug trace minimization technique to simplify manual debug? Paul Cunningham (GM, Verification at Cadence), Jim Hogan and I continue our series to highlight all the great research that’s out there in verification. Feel free to comment.
This month’s pick is Simulation-Based Bug Trace Minimization With BMC-Based Refinement. We found this paper in IEEE Transactions on CAD, 2005. The authors are/were from the University of Michigan. This is an old paper but still intriguing.
Debug, tracing back from an identified bug to the root cause, is the biggest time-sink in verification. Any contribution to reducing that time will have high value. The authors’ approach starts with a waveform trace from a simulation or semi-formal analysis. It aims to reduce the trace to a much shorter trace that still triggers the bug, an easier starting point for manual debug.
The paper describes four simulation-based and one BMC-based technique to reduce traces. They first reduce traces by removing cycles, re-simulating each time to check the bug still reproduces. At first glance this looks very unscalable, requiring O(N2) runs for a trace of N cycles. They greatly reduce this complexity first by hashing the circuit state at each cycle. They then watch for hash matches between the original trace and a re-simulation of a candidate reduced trace. If a previously hashed state is hit during re-simulation, they know that the bug can be reached from that hashed state. They can abort the simulation since it can still trigger the bug, i.e. the reduction is proven viable.
Through this process they look for any variant trace which triggers the check sooner, which becomes a new and shorter reference trace. Alternatively it may hit a state already seen in an earlier analysis at a later clock cycle. Then this trace can skip ahead, also leading to a shorter reference trace. In a final high-effort simulation step, they also look for opportunities to drop input events (rather than whole cycles), as shown in table 3.
Using common datapath functions, FPU, DES and a picoJava engine as benchmarks, the authors show impressive reductions in cycle lengths, better than 98% in most cases and better than 99% in all cases in removing unnecessary inputs. Runtime on most tests was under a minute. The most complex (picoJava) was 10hrs for 30k cycles. Reduced traces were mostly under 10 cycles.
This reminds me of an earlier paper we discussed “Using AI to locate a fault”. Both combine multiple methods in fault tracing, getting more out of the combination than out of any one method alone. This approach combines five methods, four simulation-based and one BMC-based. The results, e.g. in figure 12, clearly show how different techniques have different impact on different testcases. Which underlines that you really need all these methods. For a commercial vendor this looks very practical, a fusion of methods rather than a single super-method. Tables 5 and 6 show the bulk of reduction coming from simulation methods and a smaller incremental benefit from BMC. Also encouraging for commercial mass deployments given scalability considerations for model-checking.
Intuitively the method makes a lot of sense. State hashing and looking for matches will almost certainly be very effective on randomized simulations. It’s the classic computer science random walk problem where the drunken man walking randomly is going to do a lot of circling relative to the amount of actual useful distance moved. All these circles quickly prune away by looking for state hash matches, which should massively reduce practical runtimes.
In their experiments picoJava (around 140k gates) ran in 10 hours. That was running a 1GHz Sun blade (2005 remember). Now 3GHz servers are typical, so you’re looking at 500K gates in 10 hours for a single CPU job on a modern server farm. The algorithm is also very parallelizable, so can be scaled up by just farming re-simulation jobs out to multiple servers, sharing the same state hash database. Which makes it very commercially interesting.
Moving this into the cloud seems a great way to speed up and reach for larger designs. Thinking about it, 10 hours is an overnight run. You could slipstream these runs in behind regression runs. By the time the verification team had a chance to look at them, most or all of those traces would already be reduced.
On investment, this naturally fits into the verification suite, so it’s not an independent product. A bit of a challenge is that this is speeding up something engineers already do rather than making something possible that wasn’t possible before. Can it in some way show a huge improvement in productivity? Maybe add an AI angle for 100X improvement over time? That could enhance appeal for an investor.
First, while a lot of good ideas come from software verification, it’s nice to see some coming from hardware verification. Second, I had been looking mostly for recent papers. Paul pushed me to look at some older papers as well. Good intuition!
Click HERE to see the previous Innovation blog