This is an interesting idea, using hardware-supported speculative parallelism to accelerate simulation, with a twist requiring custom hardware. Paul Cunningham (Senior VP/GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and now Silvaco CTO) and I continue our series on research ideas. As always, feedback welcome.
The Innovation
This month’s pick is Chronos: Efficient Speculative Parallelism for Accelerators. The authors presented the paper at the 2020 Conference on Architectural Support for Programming Languages and Operating Systems and are from MIT.
Exploiting parallelism using multicore processors is one option for applications where parallelism is self-evident. Other algorithms might not be so easily partitioned but might benefit from speculative execution exploiting intrinsic parallelism. Usually, speculative execution depends on cache coherence, a high overhead especially for simulation. This method bypasses need for coherence, physically localizing task execution to compute tiles by target read-write object, ensuring that conflict detection can be detected locally, without need for global coherence management. Tasks can execute speculatively in parallel; any conflict detected can be unrolled from a task through its child tasks then re-executed without need to stall other threads.
One other point of note here. This method supports delay-based simulation, unlike most hardware acceleration techniques.
Paul’s view
Wow, what a wonderful high-octane paper from MIT! When asked about parallel computation I immediately think about threads, mutexes, and memory coherency. This is of course how modern multi-core CPUs are designed. But it is not the only way to support parallelization in hardware.
This paper proposes an alternative architecture for parallelization called Chronos that is based on an ordered queue of tasks. At runtime, tasks are executed in timestamp order and each task can create new sub-tasks that are dynamically added to the queue. Execution begins by putting some initial tasks into the queue and ends when there are no more tasks in the queue.
Tasks in the queue are farmed out to multiple processing elements (PEs) in parallel – which means Chronos is speculatively executing future tasks before the current task has completed. If the current task invalidates any speculatively executed future tasks then the actions of those future tasks are “undone” and they are re-queued. Implementing this concept correctly in hardware is not easy, but to the outside user it’s beautiful: you just code your algorithm as if the task queue is being executed serially on a single PE. No need to code any mutexes or worry about deadlock.
The authors implement Chronos in SystemVerilog and compile it to an FPGA. Much of the paper is devoted to explaining how they have implemented the task queue and any necessary unrolling in hardware for maximum efficiency. Chronos is benchmarked on four algorithms well suited to a task-queue based architecture. Each algorithm is implemented two ways: first using a dedicated algorithm-specific PE, and second using an off the shelf open source 32-bit embedded RISC-V CPU as the PE. Chronos performance is then compared to multi-threaded software implementations of the algorithms running on an Intel Xeon server with a similar price tag to the FPGA being used for Chronos. Results are impressive – Chronos scales 3x to 15x better than using the Xeon server. However, comparing Table 3 to Figure 14 makes me worry a bit that most of these gains came from the algorithm-specific PEs rather than the Chronos architecture itself.
Given this is a verification blog I naturally zoomed in on the gate-level simulation benchmark. The EDA industry has invested heavily to try and parallelize logic simulation and it has proven difficult to see big gains beyond a few specific use cases. This is mainly due to the performance of most real-world simulations being dominated by load/store instructions missing in the L3-cache and going out to DRAM. There is only one testcase benchmarked in this paper and it is a tiny 32-bit carry save adder. If you are reading this blog and would be interested to do some more thorough benchmarking please let me know – if Chronos can truly scale well on real world simulations it would have huge commercial value!
Raúl’s view
The main contribution of this paper is the Spatially Located Ordered Tasks (SLOT) execution model which is efficient for hardware accelerators that exploit parallelism and speculation, and for applications that generate tasks dynamically at runtime. Dynamic parallelism support is inevitable for simulation and speculative synchronization is an appealing option, but coherency overhead is too high.
SLOT avoids the need for coherence by restricting each task to operate (write) on a single object and supports ordered tasks to enable multi-object atomicity. SLOT applications are ordered, dynamically created tasks characterized by a timestamp and an object id. Timestamps specify order constraints; object ids specify the data dependences, i.e., tasks are data-dependent if and only if they have the same object id. (if there is a read dependency the task can be executed speculatively). Conflict detection becomes local (without complex tracking structures) by mapping object ids to cores or tiles and sending each task to where its object id is mapped.
The Chronos system was implemented in the AWS FPGA framework as a system with 16 tiles, each with 4 application specific processing elements (PE), running at 125MHz. This system is compared with a baseline consisting of 20-core/40-thread 2.4 GHz Intel Xeon E5-2676v3, chosen specifically because its price is comparable with the FPGA one (approx. $2/hour). Running a single task on one PE, Chronos is 2.45x faster than the baseline. As the number of concurrent tasks increases, the Chronos implementation scales to a self-relative speedup of 44.9x on 8 tiles, corresponding to a 15.3x speedup over the CPU implementation. They also compared an implementation based on general purpose RISC-V rather than application specific PEs; PEs were 5x faster than RISC-V.
I found the paper impressive because it covers everything from a concept to the definition of the SLOT execution model to the implementation of hardware and the detailed comparison with a traditional Xeon CPU for 4 applications. The effort is substantial, Chronos is over 20,000 lines of SystemVerilog. The result is a 5.4x mean (of the 4 applications) speedup over software-parallel versions, due to more parallelism and more use of speculative execution. The paper is also worth reading for application to non-simulation tasks; the paper includes three examples.
Share this post via:
Comments
One Reply to “Speculation for Simulation. Innovation in Verification”
You must register or log in to view/post comments.