GPUs have been proposed before to accelerate logic simulation but haven’t quite met the need yet. This is a new attempt based on emulating emulator flows. Paul Cunningham (GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and lecturer at Stanford, EE292A) and I continue our series on research ideas. As always, feedback welcome.

The Innovation
This month’s pick is GEM: GPU-Accelerated Emulator-Inspired RTL Simulation. The authors are from Peking University, China, and NVIDIA. The paper was presented at DAC 2025 and has no citations so far.
There have been previous attempts to accelerate logic simulation using GPU hardware, which have apparently foundered on a poor match between the heterogenous nature of logic circuit activity and the SIMT architecture of GPUs. This paper proposes a new approach, modeled on FPGA-based emulators/prototypers and supported by a very long instruction word architecture. It claims impressive speedup over CPU-based simulation.
This paper is from NVIDIA and Peking University and was presented at DAC2025 this year.
Paul’s view
Very interesting paper this month from NVIDIA Research and Peking University. It takes a fresh look at accelerating logic simulation on GPUs, something Cadence has invested heavily in since acquiring Rocketick in 2016. With the explosion in GPU computing for AI, customer motivation to use GPUs to accelerate simulation is even higher, and we are doubling down our efforts in this area.
An NVIDIA GPU is a massive single-instruction-multiple-thread (SIMT) machine. To harness its power requires mapping a circuit to a large number of threads that each execute the same underlying program with minimal inter-thread communication. The key to doing this is intelligent replication and intelligent partitioning of logic cones across threads. Replication reduces inter-thread communication: rather than computing shared fan-in of multiple logic cones in one thread and passing the result to other threads, just have threads for each logic cone replicate compute for that shared fan-in. Smart partitioning ensures that thread processors are well utilized: we don’t want thread processors executing very deep logic cones to leave other thread processors idle that executed short paths.
In this paper, the authors synthesize a circuit to an AND-Inverter graph. To mitigate the problem of a few deep logic cones bottlenecking parallelization, they introduce a “boomerang” partitioner. This partitioner aims to balance the fan-in width of each partition rather than the gate count of each partition. Each partition is then mapped to a bit-packed structure that can be batch loaded from memory and executed on an NVIDA GPU very efficiently. This bit-packed structure uses a 32 bit integer AND instruction followed by a 32 bit XOR with a mask instruction to perform 32 AND-INVERT operations in one shot, with all thread processors executing this same simple program.
The authors benchmark their solution, GEM, on 5 different open source designs ranging in size from 670k to 5.5M gates. Comparing GEM on an NVIDIA A100 to a “commercial” RTL logic simulator running on a single core of a Intel Xeon 6136 (Skylake), GEM runs on average 20x faster on the smallest design to 2.5x faster on the largest design. Impressive!
Raúl’s view
CPU-based RTL simulators are relatively slow and poorly scalable, while FPGA-based emulators are fast but expensive to set-up and inflexible. The heterogeneous, irregular nature of digital circuits conflicts with GPUs’ SIMT (Single Instruction, Multiple Thread) architecture. GEM (GPU accelerated emulator) overcomes these challenges yielding the following results on commodity GPUs: an average of 6x speed-up over 8-threaded Verilator and 9x over a leading commercial simulator. The system is open sourced under Apache 2.0.
GEM’s main innovations are the “boomerang executor layer” which handles intra-block efficiency (how logic is executed inside a GPU thread block), and the partitioning flow which handles inter-block scalability (how a large circuit is divided into many pieces that can be simulated in parallel on thousands of GPU cores).
Mapping logic circuits to a GPU, a common method is levelization: divide the circuit into “logic levels” so that gates at the same depth can be computed in parallel. But real circuits have many levels with only a few gates. In GPU kernels running in SIMT fashion each level would trigger a global synchronization, and most GPU threads would be idle most of the time. The result is poor GPU utilization and a large synchronization overhead. Instead, in GEM each GPU thread block (representing one circuit partition) maintains a set of bits (8192 circuit states) in shared memory and the boomerang layer executes logic across multiple levels (14 levels) in one pass. It processes these bits in a recursive, folded structure: pairs of bits A and B of the circuit state are repeatedly combined with an external constant C using bitwise logic operations:
r = (A AND B) XOR C
Conceptually, this is like “folding” the bit vector in half multiple times—each fold represents several logic levels being collapsed into one operation. It is performed in parallel across 32-bit words enabling word-level parallelism in addition to thread-level parallelism. These foldings are repeated 14 times until one resulting bit represents the output of a deep cone of logic. This “boomerang” execution pattern effectively computes the equivalent of 10-15 logic levels; because all operations happen within a thread block, synchronization is local, avoiding costly global GPU synchronizations. The boomerang shape corresponds to how logic density changes across circuit depth: many gates at shallow levels (wide part of the boomerang), few gates at deeper levels (narrow part).
The partitioning flow deals with: 1) inter-block dependencies (If two circuit partitions depend on each other’s outputs within a simulation cycle, you’d need global synchronization every time step, killing performance) and 2) replication cost; because it is based on a known algorithm called RepCut (Replication-Aided Partitioning) to reduce these dependencies by duplicating some logic across partitions so each partition can simulate independently. RepCut, was designed for 10s of CPU threads, not for hundreds or thousands of GPU thread blocks, so using it directly, the amount of duplicated logic grows to over 200% duplication for just 200+ partitions. Instead of cutting the entire circuit into hundreds of partitions in one go, GEM performs multi-stage RepCut: splitting large designs in stages, minimizing replication, aligning partition size to GPU architecture constraints (boomerang width), merging intelligently to ensure efficient GPU occupancy. Introducing one additional synchronization point between stages. This reduces replication to <3% for a 500K-gate circuit partitioned into 216 blocks.
GEM’s innovation lies in its emulator-inspired abstraction that maps circuit logic to GPU execution through a VLIW architecture and highly local memory access. The mapping flow borrows from traditional EDA synthesis, placement, and partitioning logic. It achieves high simulation density and GPU efficiency, outperforming multi-threaded Verilator (6x), commercial CPU-based tools (9x) and previous GPU approaches (8x) across design types ranging from RISC-V CPUs to AI accelerators. Keep in mind that this is 2 state bit level simulation without the bells and whistles of a commercial simulator.
GEM combines EDA methods with GPU computing in a software-only, open-source package compatible with standard GPUs, which makes it appealing. While described as RTL level, it operates at the bit level like FPGA emulators. GEM currently lacks multi-GPU support, 4-state logic, arithmetic modeling, and event-driven pruning, requiring further development to potentially become a competitive simulation alternative.
Also Read:
Cadence’s Strategic Leap: Acquiring Hexagon’s Design & Engineering Business
Cocotb for Verification. Innovation in Verification
A Big Step Forward to Limit AI Power Demand
Share this post via:

Comments
There are no comments yet.
You must register or log in to view/post comments.