Many multi-thread consistency problems emerge only in post-silicon testing. Maybe we should take advantage of that fact. Paul Cunningham (Senior VP/GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and now Silvaco CTO) and I continue our series on research ideas. As always, feedback welcome.
This month’s pick is Threadmill: A Post-Silicon Exerciser for Multi-Threaded Processors. The authors presented the paper in the 2011 DAC proceedings, publishing in both ACM and IEEE Digital Libraries. At publication the authors were at IBM Research in Haifa, Israel.
The authors’ goal is to generate multi-threaded tests to run on first silicon, requiring only a bare metal interface. This exerciser is a self-contained program with all supporting data and library functions. A compile step pre-computes and optimizes as much as possible to minimize compute requirements in the exerciser. On-silicon, the exerciser generates multiple threads with shared generated addresses to maximize potential collisions. The exerciser can run indefinitely using new randomization choices for each set of threads.
Consistency mismatches when found are taken back to emulation for debug. These start with the same random seeds used in the exerciser, rolled back a few tests before a mismatch was detected.
Great paper on a neat tool from IBM research in Israel. A hot topic in our industry today is how to create “synthetic” post-silicon tests to stress hardware beyond what is possible pre-silicon, while also composed of many short, modular, and distinct semi-randomized auto-generated sequences. Hardware bugs found by running real software workloads post-silicon typically require billions of cycles to replicate, making them impractical to port back to debug friendly pre-silicon environments. But bugs found using synthetic tests can be easily replicated by replaying only the relevant synthetic sequence that triggered the bug making them ideally suited to replication in pre-silicon environments.
Threadmill is a valuable contribution to the field of post-silicon synthetic testing of multi-threaded CPUs. It takes a high level pseudo-randomized test definition as input and compiles it into a self-contained executable to run on silicon. This generates randomized instruction sequences conforming to the test definition. Sequence generation is “online” in the silicon without need to stream any data from an offline tool, allowing generation of massive amounts of synthetic content at full silicon performance.
Lacking a golden reference for tests, Threadmill runs each test several times and checks that the final memory and register values at the end of the test are the same. This approach catches only non-determinism related bugs, e.g. those related to concurrency across multiple CPU threads – hence the tool’s name, ThreadMill. The authors rightly point out that such concurrency related bugs are some of the most common bugs that escape pre-silicon verification of modern CPUs. Threadmill can offer high value even with this somewhat limited scope.
The bulk of the paper is devoted to several techniques the authors deploy to make Threadmill’s randomization more potent. These techniques are clever and make for a wonderful read. For example, one trick is to use different random number seeds on each CPU thread for the random instruction generator, but the same random number seed across all CPU threads for random memory address generation. This trick has the effect of creating different randomized programs running concurrently on each CPU, but with each of these programs having a high probability of catching bugs related to memory consistency in the presence of read/write race conditions. Nice!
IBM’s functional verification methodology for the POWER7 processor (2011) consisted of a unified methodology including pre- and post-silicon verification. Differences between pre- and post-silicon include speed, observability, and the need for a lightweight exerciser they can load into the bare-metal chip. Both Threadmill and the pre-silicon platform (Genesys-Pro) use templates like “Store R5 into addresses 0x100+n*0x10 for addresses <0x200” to generate testcases. The key to the unified methodology is using the same verification plan, the same languages, templates, etc.
The authors describe Threadmill at a high level, with many interesting details. One example is need to run coverage analysis on an accelerator, not on the chip, because the limited observability of the silicon does not allow to measure it on the silicon. The exerciser executes the same test case multiple times and compares results; multi-pass comparison is limited but has proven effective in exposing control-path bugs and bugs in the interface of the control and data-paths. Branches are generated statically, using illegal instruction interrupts before a branch to force taking one particular branch. Data for floating point instructions is generated as a combination of tables with interesting data for a given instruction and random values. Generation of concurrent tests (multiple threads) relies on shared random number generators, e.g., to select random collision memory locations. They debug failing tests by restarting the exerciser a few bugs before the failure on the acceleration platform.
Coverage results indicate at least one high impact bug exposed on an accelerator before tape-out. Also “Results of this experience confirm our beliefs about the benefits of the increased synergy between the pre- and post-silicon domains and of using a directable generator in post-silicon validation”.
The papers are easy to follow and fun to read, lots of common sense. The shared coverage and collision experiment results are hard to judge. One must rely on the author’s comments on the contribution of post-silicon validation in their methodology. Post-silicon validation is a critical component of processor design; ultimately intertwined with design. Every group designing a complex processor will use its own methodology. It continues to be a fertile publication area. In 2022 alone Google scholar lists over 70 papers on this subject .
I’m not sure about the relevance today of the post-silicon floating point tests today. The memory consistency tests make a lot of sense to me.
Share this post via: