I have attended several past Synopsys verification events which I remember as engineering conference room, all-engineer pitches and debates. Effective but aiming for content rather than polish. This year’s event was different. First it was virtual, like most events these days, which certainly made the whole event feel more prime-time ready. Also each day of the two day event started with a keynote, further underlining the big conference feel. Finally many of the pitches, mostly from big-name customers, looked very topical – verification in AI, continuous integration, cloud architectures. Exciting stuff!
IBM on AI hardware, implications for verification
Kailash Gopalakrishnan spoke on this topic. Kailash is an IBM Fellow and Sr. Manager in the accelerator architectures and ML group at the T.J. Watson Research Center. He started by noting the rapid growth in AI model size over the past 3-5 years, more than 4 orders of magnitude. One direction he is pursuing to attack this is approximate computing. Reducing word sizes for both integer and floating point in training and in inference. This helps with both performance and power, critical for embedded applications such as in-line fraud prevention.
For such AI accelerators a wider range of word sizes increases complexity for formal methods. He sees rising formal complexity also in use of software managed scratchpads with complex arbiters. Advanced accelerators have more high bandwidth asynchronous interfaces, driving yet more increases in verification runtimes and coverage complexity. Such designs commonly build on rapidly evolving deep learning primitives and use-cases. Many more moving parts than we might normally expect when building on more stable IP and workloads for regular SoCs.
Big AI designs for datacenters are following similar paths to servers. Massively arrayed cores on a chip, with PCIe support for DMA and coherent die-to-die interfaces, ganging together many die (or chips) for training. These giants must support virtualization, potentially running multiple training tasks in a single socket. All of this needs verification. Verifying complex software stacks (TensorFlow, PyTorch on down to the hardware) running on a virtual platform together with emulation or FPGA prototyping for the hardware.
In next generation chips, modeling and verification will need to encompass AI explainability and reasoning, also secure execution. Analog AI will become more common. Unlike mixed signal verification today (e.g. around IO cores) this analog will be sprinkled throughout the accelerator. Which may raise expectations for AMS verification fidelity and performance. Finally, also for performance and power, 3D stacking will likely drive need for help more help in partitioning between stacked die. Not a new need but likely to become even more important.
HPE on growing design team methodologies
David Lacey is Chief Verification Technologist in HP Enterprise Labs and was making a plea for more focus on methodologies. In part referring to opportunities for EDA vendors to provide more support, but much more for verification teams to graduate up the verification maturity curve. Here I imagine vigorous pushback – “our process is very mature, it’s the EDA vendors that need to improve!” David isn’t an EDA vendor so his position should carry some weight. I’m guessing he sees a broad cross section, from very sophisticated methodologies to quite a few that are less so. Especially I would think in FPGA design, even in ASIC in teams with backgrounds in small devices.
David walked through 5 levels of maturity, starting from directed testing only. I won’t detail these here, but I will call out a few points I thought were interesting. At level 3 where you’re doing constrained random testing, he mentioned really ramping up metrics. Coverage certainly but also compute farm metrics to find who may be hogging an unusual level of resource. Performance metrics especially in regressions. Generally looking for big picture problems across the project as well as trends by block (coverage not converging for example).
He stresses automation, taking more advantage of tool features, adding in-house scripting to aggregate data after nightly runs so you can quickly see what needs extra attention. Eventually moving to continuous integration methodologies, using Jenkins or similar tools. Mature teams no longer practice “Everybody stop, now we’re going to integrate all checkins and regress”. He also stressed working with your EDA vendor to implement upgrades to simplify these and other tasks.
Finally, the ultimate stage in maturity. Using emulation to shift left to enable SW/HW co-development for system design. Taking advantage of the ML options we now see in some verification flows. These don’t require much ML understanding on your part but can offer big advances in getting to higher coverage quicker, simplifying static testing, accelerating root cause analysis on bugs and reducing regression run-times. Consider also the ROI of working with your current compute farm versus upgrading servers, exploiting the cloud or a hybrid approach. From one generation to the next, server performance advances by 50%. Per unit of throughput, a server upgrade is much cheaper than adding licenses. Moving to the cloud has flexibility advantages but you need to actively manage cost. And EDA vendors should add as-a-service licensing models to make EDA in the cloud a more attractive option.
Lots of good material. The whole session was recorded, I believe you can watch any of the talks through the end of the year. I’ll be posting more blogs over the next 3 months on other sessions in this valuable and virtual conference.