100X800 Banner (1)

Harnessing Modular Vector Processing for Scalable, Power-Efficient AI Acceleration

Harnessing Modular Vector Processing for Scalable, Power-Efficient AI Acceleration
by Jonah McLeod on 02-24-2025 at 6:00 am

shutterstock 2425981653

The dominance of GPUs in AI workloads has long been driven by their ability to handle massive parallelism, but this advantage comes at the cost of high-power consumption and architectural rigidity. A new approach, leveraging a chiplet-based RISC-V vector processor, offers an alternative that balances performance, efficiency, and flexibility, steering towards heterogeneous computing to align with AI/ML-driven workloads. By rethinking vector computation and memory bandwidth management, a scalable AI accelerator could rival NVIDIA’s GPUs in both cloud and edge computing.

A modular chiplet architecture is essential for achieving scalability and efficiency. Instead of a monolithic GPU design, a system composed of specialized chiplets can optimize different aspects of AI computation. A vector processing chiplet, built on the RISC-V vector extension, serves as the primary computational unit, dynamically adjusting vector length to fit workloads of varying complexity. A matrix multiplication accelerator complements this unit, handling the computationally intensive operations found in neural networks.

Like matrix multiplication accelerators, tensor cores, cryptography, and AI/ML accelerators enhance efficiency and performance. To address the memory bottlenecks that often slow down AI inference and training, high-bandwidth on-package memory chiplets integrate closely with the compute units, reducing latency and improving data flow. Managing these interactions, a scalar processor chiplet oversees execution scheduling, memory allocation, and communication across the entire system.

One of the fundamental challenges in AI acceleration is mitigating instruction stalls caused by memory latency. Traditional GPUs rely on speculative execution and complex replay mechanisms to handle these delays, but a chiplet-based RISC-V vector processor could take a different approach by implementing time-based execution scheduling. Instructions are pre-scheduled into execution slots, eliminating the need for register renaming and reducing overhead. By intelligently pausing execution during stalled loads, an advanced execution time freezing mechanism redefines RISC-V vector processing, ensuring peak performance and power efficiency. This architecture eliminates inefficiencies and unlocks the full potential of vector computing, keeping vector units fully utilized. This ‘Fire and Forget’ time-based execution scheduling enables parallelism, low power, and minimal overhead while maximizing resource utilization and hiding memory latencies.

Chiplet communication plays a pivotal role in determining overall system performance. Unlike monolithic GPUs that rely on internal bus architectures, a chiplet-based AI accelerator needs a high-speed interconnect to maintain seamless data transfer. The adoption of UCIe (Universal Chiplet Interconnect Express) could provide an efficient die-to-die communication framework, reducing latency between compute and memory units. An optimized network-on-chip (NoC) further ensures that vector instructions and matrix operations flow efficiently between chiplets, preventing bottlenecks in high-throughput AI workloads.

Competing with NVIDIA’s ecosystem requires more than just hardware innovation. Higher vector unit utilization keeps the vector pipeline fully active, maximizing throughput and eliminating idle cycles. Fewer stalls and pipeline flushes prevent execution misalignment, ensuring smooth and efficient instruction flow. Superior power efficiency reduces unnecessary power consumption by pausing execution only when needed, and optimized instruction scheduling aligns vector execution precisely with data availability, boosting overall performance.

Software plays an equally important role in adoption and usability. A robust compiler stack optimized for RISC-V vector and matrix extensions ensures that AI models can take full advantage of the hardware. Custom libraries tailored for deep learning frameworks such as PyTorch and TensorFlow bridge the gap between application developers and hardware acceleration. A transpilation layer such as CuPBoP (CUDA for Parallelized and Broad-range Processors) enables seamless migration from existing GPU-centric AI infrastructure, lowering the barrier to adoption.

CuPBoP presents a compelling pathway for enabling CUDA workloads on non-NVIDIA architectures. By supporting multiple Instruction Set Architectures (ISAs), including RISC-V, CuPBoP enhances cross-platform flexibility, allowing AI developers to execute CUDA programs without the need for intermediate portable programming languages. Its high CUDA feature coverage makes it a robust alternative to existing transpilation frameworks, ensuring greater compatibility with CUDA-optimized AI workloads. By leveraging CuPBoP, RISC-V developers could bridge the gap between CUDA-native applications and high-performance RISC-V architectures, offering an efficient, open-source alternative to proprietary GPU solutions.

Energy efficiency is another area where a chiplet-based RISC-V accelerator can differentiate itself from power-hungry GPUs. Fine-grained power gating allows inactive compute units to be dynamically powered down, reducing overall energy consumption. Near-memory computing further enhances efficiency by placing computation as close as possible to data storage, minimizing costly data movement. Optimized vector register extensions ensure that AI workloads make the most efficient use of available compute resources, further improving performance-per-watt compared to traditional GPU designs.

Interestingly, while the idea of a RISC-V chiplet-based AI accelerator remains largely unexplored in public discourse, there are signals that the industry is moving in this direction. Companies such as Meta, Google, Intel, and Apple have all made significant investments in RISC-V technology, particularly in AI inference and vector computing. However, most known RISC-V AI solutions, such as those from SiFive, Andes Technology, and Tenstorrent, still rely on monolithic SoCs or multi-core architectures, rather than a truly scalable, chiplet-based approach.

A recent pitch deck from Simplex Micro suggests that a time-based execution model and modular vector processing architecture could dramatically improve AI processing efficiency, particularly in high-performance AI inference workloads. While details on commercial implementations remain sparse, the underlying patent portfolio and architectural insights indicate that the concept is technically feasible. (see table)

Patent # Patent Title Granted
US-11829762-B2 Time-Resource Matrix for a Microprocessor 11/28/2023
US-12001848-B2 Phantom Registers for a Time-Based CPU 11/12/2024
US-11954491-B2 Multi-Threaded Microprocessor with Time-Based Scheduling 4/9/2024
US-12147812-B2 Out-of-Order Execution for Loop Instructions 11/19/2024
US-12124849-B2 Non-Cacheable Memory Load Prediction 10/22/2024
US-12169716-B2 Time-Based Scheduling for Extended Instructions 12/17/2024
US-11829767-B2 Time-Aware Register Scoreboard 11/28/2023
US-11829762-B2 Statically Dispatched Time-Based Execution 11/28/2023
US-12190116-B2 Optimized Instruction Replay System 1/7/2025

The strategic positioning of such an AI accelerator depends on the target market. Data centers seeking alternatives to proprietary GPU architectures would benefit from a flexible, high-performance RISC-V-based AI solution. Edge AI applications, such as augmented reality, autonomous systems, and industrial IoT, could leverage the power efficiency of a modular vector processor to run AI workloads locally without relying on cloud-based inference. By offering a scalable, customizable solution that adapts to the needs of different AI applications, a chiplet-based RISC-V vector accelerator has the potential to challenge NVIDIA’s dominance.

As AI workloads continue to evolve, the limitations of traditional monolithic architectures become more apparent. A chiplet-based RISC-V vector processor is more adaptable to customization, modular, scalable, high-performance, power-efficient, and cost-effective—ideal for AI, ML, and HPC within an open-source ecosystem. A chiplet-based RISC-V vector processor represents a shift toward a more adaptable, energy-efficient, and open-source approach to AI acceleration. By integrating time-based execution, high-bandwidth interconnects, and workload-specific optimizations, this emerging architecture could pave the way for the next generation of AI hardware, redefining the balance between performance, power, and scalability.

Also Read:

Webinar: Unlocking Next-Generation Performance for CNNs on RISC-V CPUs

An Open-Source Approach to Developing a RISC-V Chip with XiangShan and Mulan PSL v2

2025 Outlook with Volker Politz of Semidynamics


Rethinking Multipatterning for 2nm Node

Rethinking Multipatterning for 2nm Node
by Fred Chen on 02-23-2025 at 10:00 am

https3A2F2Fsubstack post media.s3.amazonaws.com2Fpublic2Fimages2F4322f291 179c 4968 b59c 218f0cf0ab94 385x289

Whether EUV or DUV doesn’t matter at 20 nm pitch
The International Roadmap for Devices and Systems, 2022 Edition, indicates that the “2nm” node due in 2025 (this year) has a minimum (metal) half-pitch of 10 nm [1]. This is, in fact, less than the resolution of a current state-of-the-art EUV system, with a numerical aperture (NA) of 0.33. Even for a next-generation, high-NA (0.55 NA) EUV system, a 20 nm line pitch can only be fundamentally imaged by the basic interference of two plane waves. As Figure 1 shows, the stochastic behavior is expected to be unmanageable, compared to a similarly imaged 80 nm pitch on a state-of-the-art ArF immersion system (Figure 2).

Figure 1. The stochastic appearance (i.e., scattered electron density) of a 10 nm half-pitch image is made worse with 3 nm blur, expected in metal oxide resists [2]. A 20 mJ/cm2 absorbed dose is assumed. The dipole-induced fading is modeled as a + or – 1 nm image shift for either of the two pole’s produced images.

Figure 2. The stochastic appearance (i.e., absorbed photon density) of a 40 nm half-pitch image with ArF dipole illumination is negligible compared to the EUV case of Figure 1, even with a 2 mJ/cm2 absorbed dose assumed. A 6% attenuated phase-shift mask is assumed to be used for negative-tone imaging.

Consequently, double patterning is unavoidable even with EUV lithography. However, any double patterning scheme in EUV lithography for 2nm node still requires the imaging of a 10 nm linewidth, e.g., the cell with four routing tracks and two wide rails (in keeping with TSMC N2 without backside power delivery) shown in Figure 3. Therefore, from what we saw in Figure 1, we still expect a challenged line edge and linewidth definition for the feature size ~ 10 nm.

Figure 3. A 6-track cell with 10 nm half-pitch features (left) can be formed using double patterning, but a 10 nm linewidth still needs to be formed as the core (right). Note: each square represents 10 nm. The red areas are gaps which are filled after the spacers are formed.

Thus, we expect that the linewidth used in double patterning will itself not be defined by direct exposure but instead by using another double patterning, specifically, self-aligned double patterning (SADP). SADP involves depositing a spacer over the mandrels, etching back to leave only the sidewalls covered, then removing the mandrels. This double the feature density, as there are two sidewall spacers per mandrel (Figure 4).

Figure 4. Self-Aligned Double Patterning (SADP) doubles feature density by using spacers [3].

In 2021 (prior to 3nm production start), TSMC hinted at this approach in its disclosure in US patent application 20210232747 [4]:

“A method includes forming a first mandrel pattern and a second mandrel pattern. The first mandrel pattern includes at least first and second mandrels for a mandrel-spacer double patterning process. The second mandrel pattern includes at least a third mandrel inserted between the first and second mandrels. The first mandrel pattern and the second mandrel pattern include a same material. The first and second mandrels are merged together with the third mandrel to form a single pattern.”

This is essentially the approach known as LELE-SADP. LELE refers to “Litho-Etch-Litho-Etch”, which would lead to the formation of the two separate mandrel patterns. These mandrel patterns in combination, act as the base pattern, or core pattern, for SADP.

Figure 5. LELE is used to generate the black core pattern of Figure 3 (right). The two different colors indicate the two different exposures.

Some of the core pattern linewidths shown in Figure 5 are still too small to be printed directly, so they need to be trimmed from a larger exposed linewidth (Figure 6).

Figure 6. A larger linewidth (left) is trimmed down to give the target 10 nm linewidths (right).

Note, that the trimming cannot be used to get the core pattern in Figure 3, since then the exposed 10 nm gaps would be too narrow (Figure 7).

Figure 7. Trimming is not feasible here, since the starting 10 nm gaps here (left) are too narrow.

Thus, we see that LELE-SADP is the only option for producing the 6-track cell with four routing tracks and two wide rails, even with EUV. The clincher is, in fact, DUV can produce the exact same 10 nm minimum half-pitch dimensions, with starting exposure pitches of 480 nm. This allows substantial reduction of cost associated with EUV use.

Beyond 2nm

Backside power delivery at the 2nm node and beyond will place the rails at a different layer than the metal routing. This could improve the multipatterning logistics by putting the wide rails and narrow tracks on different layers below and above the transistors, respectively. Thus, a regular grid of minimum pitch lines will suffice for the routing tracks. At 16-18 nm pitch, EUV would be implementing Self-Aligned Quadruple Patterning (SAQP), which would be SADP applied twice successively. DUV would be implementing Self-Aligned Sextuple Patterning (SASP), which would be SADP immediately followed by SATP (Self-Aligned Triple Patterning) [5]. Both EUV SAQP and DUV SASP only require one mask exposure, which will be an improvement over the two masks for LELE-SADP. It is worth noting that SASP takes the resolution of ArF immersion lithography from 38 nm half-pitch down to one-sixth of that, or 6.3 nm half-pitch.

References

[1] https://irds.ieee.org/editions/2022/irds%E2%84%A2-2022-lithography

[2] Z. Belete et al., J. Micro/Nanopattern. Mater. Metrol. 20, 014801 (2021); L. F. Miguez et al., Proc. SPIE 12498, 124980E (2023).

[3] US Patent 5328810, originally assigned to Micron, now expired. https://en.wikipedia.org/wiki/Multiple_patterning#/media/File:Spacer_Patterning.JPG. Creative Commons license CC BY-SA 3.0: https://creativecommons.org/licenses/by-sa/3.0/

[4] Now US Patent 11748540, assigned to TSMC, expires 2035. The scope has been limited to where the third mandrel is shorter than the first and second mandrels.

[5] US Patent 7842601, assigned to Samsung, expires 2029.

Thanks for reading Multiple Patterns! Subscribe for free to receive new posts and support my work.