Assertion based verification only catches problems for which you have written assertions. Is there a complementary approach to find problems you haven’t considered – the unknown unknowns? Paul Cunningham (Senior VP/GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and now Silvaco CTO) and I continue our series on research ideas. As always, feedback welcome.
This month’s pick is Machine Learning-based Anomaly Detection for Post-silicon Bug Diagnosis. The paper published in the 2013 DATE Conference. The authors are/were from the University of Michigan.
Anomaly detection methods are popular where you can’t pre-characterize what you are looking for, in credit card fraud for example or in real-time security where hacks continue to evolve. The method gathers behaviors over a trial period, manually screened to be considered within expected behavior, then looks for outliers in ongoing testing as potential problems for closer review.
Anomaly detection techniques either use statistical analyses or machine learning. This paper uses machine learning to build a model of expected behavior. You could also easily imagine this analysis being shifted left into pre-silicon verification.
This month we’ve pulled a paper from 10 years ago on using machine learning to try and automatically root cause bugs in post-silicon validation. It’s a fun read and looks like a great fit for re-visiting again now using DNNs or LLMs.
The authors equate root-causing post-silicon bugs to credit card fraud detection: every signal traced in every clock cycle can be thought of as a credit card transaction, and the problem of root causing a bug becomes analogous to identifying a fraudulent credit card transaction.
The authors’ approach goes as follows: divide up simulations into time slices and track the percent of time each post-silicon traced debug signal is high in each time slice. Then partition the signals based on the module hierarchy, aiming for a module size of around 500 signals. For each module in each time slice train a model of the “expected” distribution of signal %high times using a golden set of bug free post-silicon traces. This model is a very simple k-means clustering of the signals using difference in %high times as the “distance” between two signals.
For each failing post-silicon test, the %high signal distribution for each module in each time slice is compared to the golden model and the number of signals whose %high time is outside the bounding box of its golden model cluster are counted. If this number is over a noise threshold, then those signals in that time slice are flagged as the root cause of the failure.
It’s a cool idea but on the ten OpenSPARC testcases benchmarked, 30% of the tests do not report the correct time slice or signals, which is way too high to be of any practical use. I would love to see what would happen if a modern LLM or DNN was used instead of simple k-means clustering.
This is an “early” paper from 2013 using machine learning for post-silicon bug detection. For the time this must have been advanced work listed with 62 citations in Google Scholar.
The idea is straight forward: run a test many times on a post-silicon design and record the results. When intermittent bugs occur, different executions of the same test yield different results, some passing and some failing. Intermittent failures, often due to on-chip asynchronous events and electrical effects, are among the most difficult to diagnose. The authors briefly consider using supervised learning, in particular one-class learning (there is only positive training data available, bugs are rare), but discard it as “not a good match for the application of bug finding”. Instead, they apply k-means clustering; similar results are grouped into k clusters consisting of “close” results minimizing the sum-of-squares distance within clusters. The paper reveals numerous technical details necessary to reproduce the results: Results are recorded as the “fraction of time the signal’s value was one during the time step”; the number of signals from a design, of the order of 10,000, is the dimensionality in k-means clustering which is NP-hard with respect to the number of dimensions, so the number of signals is capped to 500 using principal component analysis; the number of clusters can’t be too small (underfitting) nor too large (overfitting); a proper anomaly detection threshold needs to be picked, expressed as the percentage of the total failing examples under consideration; time localization of a bug is achieved by two-step anomaly detection, identifying which time step presents a sufficient number of anomalies to reveal the occurrence of a bug and then in a second round identifying the responsible bug signals.
Experiments for an OpenSPARC T2 design of about 500M transistors ran 10 workloads of test lengths ranging between 60,000 and 1.2 million cycles 100 times each as training. Then they injected 10 errors and ran 1000 buggy tests. On average 347 signals were detected for a bug (ranging from none to 1000) and it took ~350 cycles of latency from bug injection to bug detection. Number of clusters and detection threshold strongly influence the results, as does the training data quantity. False positives and false negatives added up to 30-40 (in 1000 buggy tests).
Even though the authors observe that “Overall, among the 41,743 signals in the OpenSPARC T2 top-level, the anomaly detection algorithm identified 347, averaged over the bugs. This represents 0.8% of the total signals. Thus, our approach is able to reduce the pool of signals by 99.2%”, in practice this may not be of great help to an experienced designer. 10 years have passed, it would be interesting to repeat this work using today’s machine learning capabilities, for example LLMs for anomaly detection.Share this post via: