You probably know the value proposition for using AI and ML (machine learning) in simulation regressions. There are lots of knobs you can tweak on a simulator, all there to help you squeeze more seconds, or minutes out of a run. If you know how to use those options. But often it’s easier to talk to your friendly AE, get a reasonable default setup and stick with that. Consider that a sort of one-step learning.
However, what works well in one case may not be optimal in others. Learning must evolve as designs and test cases change. You can’t reasonably call the AE in for every run, and you shouldn’t have to. ML can automate the learning. Which makes sense, but what I had not realized is that one of the big impact areas for this technology is for sanity regressions. Vishwanath (Vish) Gunge of Microsoft elaborated at Synopsys Verification Day 2021.
Why short regressions are such a good fit
Sanity tests are those tests you run to make sure you (or someone else) didn’t do something stupid. Like accidentally checking in code that you hadn’t finished fixing. Or leaving a high verbosity debug switched turned on. When you want to integrate all the code in a big subsystem of the whole SoC, probabilities of a basic mistake add up quickly. We design sanity tests to smoke these problems out quickly. Because the last thing you want is to launch overnight regressions, then come back in the morning to garbage results. Sanity tests are designed to run quickly, maybe a few minutes, at most say 30 minutes, in parallel across many machines.
Seems like that wouldn’t be where you would find a big win in ML optimization. But you’d be wrong. It’s not the test run-time that matters, it’s the frequency of those tests. Vish said that in their environment, sanity regressions consume huge compute resources, running many times per day. Which I read as them using those regressions in the best possible way – flushing out basic mistakes as a per-designer level, a per-subsystem level and a full integration level. When a mistake is found, a sanity test (or tests) must be re-run. Lot of checking before time is invested in expensive full regressions. Which is why ML can have an important impact.
Synopsys VCS® offers a dynamic performance optimization (DPO) option based on both proprietary and ML methods. I don’t know the internal details, but it is interesting that they use other methods in addition to ML. ML is the hot topic these days but it’s not always the most efficient way to get to a good result. Rule-based systems can be more semantically aware and converge quicker to an approximate solution, from which ML can then further optimize. At least that’s my guess.
That said, this is AI/ML so there is a “training” phase and an “application” phase. All packaged for ease of use, no AI skills required by the end user.
Dynamic performance optimization in action
Vish presented analysis comparing the non-AI (base-level) run-time with learning phase and the application phase on the same set of sanity runs. For DPO they used all optimization apps available as a starting point, for example FGP (fine-grained parallelism) with multiple cores. Naturally learning phase runs were slower than the base level, perhaps by ~30%. However, application runs were on average 25% faster, allowing them to do ~30% more of these regressions per day.
Vish stressed that some thought is required to get maximum benefit in these flows since learning takes more time than base runs. He suggested running learning once every few days as the design is evolving, to keep optimizations reasonably current as design and tests change. Learning can run less frequently as the project is nearing signoff since optimum settings shouldn’t be expected to change as often.
A very interesting and practical review. You can learn more from the recorded session. Vish’s talk is one of the early sessions on Day 2.