Following a similar topic we covered early last year, here we look at updated research to accelerating RTL simulation through domain-specific hardware. Paul Cunningham (GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and lecturer at Stanford, EE292A) and I continue our series on research ideas. As always, feedback welcome.
The Innovation
This month’s pick is Accelerating RTL Simulation with Hardware-Software Co-Design. This was published in in the 2023 IEEE/ACM International Synposium on Microarchitecture and has 2 citations. The authors are from MIT CSAIL (CS/AI Lab).
This work is from the same group lead as the earlier paper. Their new approach, ASH adds dataflow acceleration, not available in the earlier work, which together with speculation provides the net large performance gain in this research.
Paul’s view
Important blog to end our year. This paper is a heavy read but it’s on a billion dollar topic for verification EDA: how to get a good speed-up from parallelizing logic simulation. Paper is out of MIT, from the same team that published the Chronos paper we blogged on back in March 2023 (see here). This team are researching hardware accelerators that operate by scheduling timestamped tasks across an array of processing elements (PEs). The event queue semantics of RTL logic simulation map well to this architecture. Their accelerators also include the ability to do speculative execution of tasks to further enhance parallelism.
As we blogged in 2023, while Chronos showed some impressive speed-ups, the only result shared was for the gate-level simulation of a single 32 bit adder. Fast forward to today’s blog and we have some serious results on 4 credible RTL testcases including an open-source GPU and an open-source RISC-V core. Chronos doesn’t cut it on these more credible testcases – actually it appears to slow down the simulations. However, this month’s paper describes some major improvements on Chronos that look very exciting on these more credible benchmarks – in the range of 50x speed-up over a single core simulation. The new architecture is called SASH, a Speculative Accelerator for Simulated Hardware.
In Chronos, each task can input and output only one wire/reg value change. This limits it to a low level of abstraction (i.e. gate-level), and also conceptually means that any reconvergence in logic is “unfolded” into cones causing significant unnecessary replication of tasks. In SASH each task can input and output multiple reg/wire changes so tasks can be more like RTL always blocks. Input/output events are passed as “arguments” through an on chip network and queued at PEs until all arguments for a task are ready. Speculative task execution is also elegantly implemented with some efficient HW. The authors modify Verilator (an open-source RTL simulator) to compile to SASH. Overall, very impressive work.
One important thing to note: the authors do not actually implement SASH in an ASIC or on an FPGA. A virtual model of SASH built using Intel’s Pin utility (a low level X86 virtual machine utility with just-in-time code instrumentation capabilities). I look forward to seeing a future paper that puts it in silicon!
Raúl’s view
In March of 2023 we reviewed Chronos (published in March 2020) , based on the Spatially Located Ordered Tasks (SLOT) execution model. This model is particularly efficient for hardware accelerators that leverage parallelism and speculation, as well as for applications that dynamically generate tasks at runtime. Chronos was implemented on FPGAs and, on a single processing element (PE), outperformed a comparable CPU baseline by 2.45x. It demonstrated the potential for greater scalability, achieving a 15.3x speedup on 32 PEs.
Fast forward roughly three and a half years, and the same research group published the paper we review here, on ASH (Accelerator of Simulated Hardware), a co-designed architecture and compiler specifically for RTL simulation. ASH was benchmarked on 256 cores, achieving a 32.4x acceleration over an AMD Zen2 based system, and a 21.3x speedup compared to a simulated, special-purpose multicore system.
The paper is not easy to read. The initial discussion on why RTL simulation is difficult and needs fine grain parallelism to handle both dataflow parallelism and selective execution / low activity factors is still easy to follow. The ASH architecture comes in two flavors: DASH (Dataflow ASH) provides novel hardware mechanisms for dataflow execution of small tasks; and SASH (Selective event-driven ASH) extends DASH with selective execution, running only tasks whose inputs change during a given cycle. The latter is obviously the more effective one.
The compiler implementation for these architectures adds 12K lines of code to Verilator, while maintaining Verilator’s fast compilation times (Verilator is a full-featured open-source simulator for Verilog/SystemVerilog). The HW implementation is evaluated “using a simulator based on Swarm’ simulator [2, 27, 76], which is execution-driven using Pin [36, 43]”. The area of a HW implementation of SASH in a 7nm process is estimated to be a modest 115mm2. These descriptions, however, are not self-contained and require additional reading for a full understanding. The paper includes a detailed architectural analysis, covering aspects such as prefetching instructions, prioritized dataflow, queue utilization, etc. It also compares ASH to related work, including of course Chronos and other dataflow / speculative execution architectures, as well as HW emulators and GPU acceleration.
The paper addresses specifically accelerating RTL simulation. It tackles the challenges of RTL simulation through a combination of hardware and software, using dataflow techniques and selective execution. Given the sizable market for emulators in the EDA industry, there is potential for these ideas to be commercially adopted, which could significantly accelerate RTL simulation.
Share this post via:
Comments
There are no comments yet.
You must register or log in to view/post comments.