Key Takeaways
- The paper discusses post-silicon validation of the IBM POWER9 processor, emphasizing its relevance amid growing multi-core processor adoption among hyperscalers.
- IBM's Threadmill, used for processor verification, has seen enhancements since the POWER7, including optimized template allocation and debugging strategies for multi-core write drops.
- The validation methodology for POWER9 has improved bug detection rates significantly compared to POWER8, with new techniques like AI prioritization and hardware irritators to induce bugs.
What’s new in debugging multi-/many-core systems? Paul Cunningham (GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and lecturer at Stanford, EE292A) and I continue our series on research ideas. As always, feedback welcome.
The Innovation
This month’s pick is Post-Silicon Validation of the IBM POWER9 Processor. This was published in the 2020 DATE conference. The authors are from IBM and the paper has 1 citation.
This topic continues to attract interest given accelerating growth of these platforms among hyperscalers, though for some reason the topic has created barely a ripple among our usual research paper resources. An exception is the IBM Threadmill paper we covered in 2022 and a number of following papers from the same group. Here we review the latest of these papers, describing IBM application on their POWER9 processor.
The same basic approach continues from the earlier paper, testing on post-silicon using a bare-metal exerciser with automated randomization between cycles. Several important refinements have been added. One I find interesting is irritators, used to bias towards multi-thread (and possibly multi-core) conflicts.
Paul’s view
We’re zooming back in again this month on randomized instruction generation for processor verification. A few years ago we blogged on a tool called Threadmill used by IBM for verification of their POWER7 processor. This month we’re checking out a short paper on their experiences verifying the POWER9 processor.
More and more companies are developing custom processors based either on Arm-64 or RISC-V ISAs. Arm-based computing is scaling out in datacenters and laptops, and RISC-V processors are becoming widespread in a variety of embedded applications. Verifying processors, especially advanced ones with multiple cores and multiple out-of-order execution pipelines is really hard and somewhat of a dark art.
Threadmill is a low level “exerciser” software program that runs on the bare metal processor directly (i.e. without any OS). It is configured with templates – snippets of machine code parametrized so they can be randomized in some way – randomize instructions, randomize addresses, etc. The exerciser can be run pre-silicon in simulation, emulation, or on FPGA, and also can be run post-silicon in the lab.
This paper shares several interesting new nuggets on how IBM enhanced Threadmill since POWER7: Firstly, they weighted the runtime allocation of templates during emulation, running templates that find more bugs for 10-100x more clock cycles. Second, they deployed some clever information encoding tricks to assist in debug. For example, for a bug related to dropping a write when multiple cores increment the same memory address, they have each core increment that memory address by a different amount. Then the difference between actual and expected value in that address tells them which core’s write was dropped due to the race. Third, they enhance Threadmill with more tricks to bias randomization to better hit bugs. The original Threadmill paper from POWER7 shares the trick of using the same random seed across multiple cores for memory addresses. This increases the frequency of load/store races. In this POWER9 paper they biased addresses to also align with memory page boundaries, to increase the frequency of cross-page accesses. Lastly, they used AI to help further prioritize templates to hit coverage faster.
All-in compared to POWER8 there were 30% more bugs found in 80% of the time. Decent progress on a very tough problem!
Raúl’s view
State-of-the-art processors, such as the IBM POWER9 processor described in this paper, typically undergo multiple tape-outs. Pre-silicon verification cannot identify all bugs, particularly those related to hard-to-hit software timing issues, very long loops, or deep power states, which are not exposed by the Instruction Set Simulator (ISS). This challenge is exacerbated in multi-core, multi-threaded architectures.
The reviewed paper outlines the validation methodology IBM implemented for the POWER9 processor. In October 2022, we examined the methodology used for validating the POWER7 processor, many aspects of which remain applicable.
The approach involves using a bare-metal, self-contained exerciser called Threadmill, which generates sequences of instructions based on templates. These sequences are executed pre-silicon on the ISS and within a highly instrumented Exerciser on Accelerator (EoA) environment. Root cause analysis is considered complete only when a bug is reproduced across Simulation, EoA, and post-silicon Lab. The paper details numerous practical aspects of this process. For example, when the bug rate declines, hardware irritators are employed to induce new bugs, such as by artificially reducing cache sizes and queue depths. Templates with high RTL coverage that uncover numerous bugs are executed 10 to 100 times longer than usual on the accelerator.
IBM’s overall validation methodology has been improving, with results for POWER9 validation compared to POWER8 showing an increase in bugs found in EoA from 1% to 6%, in post-silicon from 1% to 4%, and a reduction in the days required to root cause 90% of the bugs from 31 to 17.
There are open-source instruction generators for RISC-V available on GitHub. The RISC-V DV (Design Verification) framework, maintained by CHIPS Alliance, is an open-source tool for verifying RISC-V processor cores. FORCE-RISC-V, an instruction sequence generator for the RISC-V instruction set architecture from Futurewei supports multi-core and multi-threaded instruction generation.
Overall, the paper provides valuable insights, especially for practitioners involved in processor validation.
Also Read:
Embracing the Chiplet Journey: The Shift to Chiplet-Based Architectures
2024 Retrospective. Innovation in Verification
Accelerating Automotive SoC Design with Chiplets
Accelerating Simulation. Innovation in Verification
Share this post via:
Comments
There are no comments yet.
You must register or log in to view/post comments.