Artificially stalling datapaths and virtual channels is a creative method to uncover corner case timing bugs. A paper from Nvidia describes a refinement to this technique. Paul Cunningham (GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and now Silvaco CTO) and I continue our series on research ideas. As always, feedback welcome.
This month’s pick is Deep Stalling using a Coverage Driven Genetic Algorithm Framework. The paper published in the 2021 IEEE VLSI Test Symposium. The authors are from Nvidia.
Congestion is most likely to expose problems which can lead to deadlocks, ordering rule violations and credit underflow/overflow problems. Finding such problems requires creating a method to randomly stall FIFOs and pipelines to tease out those timing corner cases. Stalls create backpressure which is most likely to trigger such problems.
Verifiers can select which FIFOs to stall, and when. Given many FIFOs in a design, randomization is the default method to (weakly) optimize coverage in analysis. On the other hand, it creates concerns that it may miss many potential problems. The authors show how they apply genetic algorithms to learn how to improve coverage, using FIFO fill and RAM occupancy statistics as coverage metrics.
This is a tight paper, easy to read, and provides a clear contribution. It tackles a very specific but important problem: how to most efficiently cover FIFO stalls in functional verification. The authors share that their GPU testbenches have special code to artificially force a FIFO to report full. This code is controlled by some randomized parameters: the probability of triggering the force, and the time window over which the force is held. There is a set of such parameters for each FIFOs in the design.
They use a genetic algorithm to select the best values of these parameters to maximize coverage. Each iteration of the genetic algorithm requires re-running all tests, which limits both initial population size and evolution cycles for the algorithm. To get around this limit, they train a neural network to predict functional coverage based on parameter settings and use this neural network for natural selection in their genetic algorithm rather than actually re-running tests. Using this neural network they are able to achieve a 60x increase in genetic algorithm capacity.
Results are solid – on a key system level coverage metric for the number of stalled GPU shader threads, the authors method can push up to 126 out of a theoretical maximum of 128 versus a baseline of only 90 without using the genetic algorithm and neural network. Using only the genetic algorithm but no neural network achieves 113.
In their introduction the authors note that there is no parameter in their testbench to directly control simultaneous stalling of multiple FIFOs. I can’t help but feel that such a parameter could be very effective to build up “backpressure” from multiple FIFOs being stalled and drive up corner case coverage. Identifying the appropriate groups of FIFOs to stall simultaneously to achieve the necessary backpressure could be well suited to the genetic algorithm proposed by the authors.
When a FIFO is full it stalls; it backpressures the transmitter trying to send a value to it. In well-designed systems this rarely occurs. To be able to simulate what happens if FIFOs fill, Nvidia inserts artificial stalls to generate backpressure. They generate stall lengths in Monte Carlo simulation to meet a given coverage goal.
The authors accelerate this process in two ways:
- Using a Genetic Algorithm framework that evolves stall parameters using elitism, Roulette Wheel Selection and Standard Crossover. This boosts coverage, normalized by simulation time, by 163% for a design called UnitA and by 88% for UnitB. Looking at individual coverage objectives, most of them get a boost. One example showed 472%, although 4 out 11 don’t get any boost. In one case there was slight drop because the multi-objective evolutionary algorithm was trading this objective to maximize others.
- A Deep Learning model learns the relationship function between the stalling parameters and the coverage metric. The DL model used is a “5-layered MultiLayer Perceptron (MLP) with batch normalization, dropout and RELU applied to all hidden layers. It uses a sigmoid activation function for the final layer.” It ran on 30,000 tests to generate data to train and validate the model. The model predicts the values for both the test data with an accuracy of 85% for the top 1,000 sorted tests and 57% for all 10,000 tests.
The authors conclude that they could intelligently tune stall parameters to significantly boost coverage metric over purely random stalling. This seems reasonable for the GA part. Although it may miss cases covered by a purely random approach as it favors certain parameter values. The DL model is an intriguing addition which presumably needs further development to rise above 57% accuracy.
It is a well written paper and easy to follow,. It focuses totally on the application and just states which genetic algorithm and deep learning model are used. I think this will appeal to designers and EDA tool builders who have added deep learning and genetic algorithms to their portfolio.
In reading around this topic, I noticed multiple articles on backpressure routing in NoCs. This method may possibly have value there also.Share this post via: