RVN! 26 Banner revised (800 x 100 px) (600 x 100 px)

Tiling Support in SiFive’s AI/ML Software Stack for RISC-V Vector-Matrix Extension

Tiling Support in SiFive’s AI/ML Software Stack for RISC-V Vector-Matrix Extension
by Daniel Nenni on 12-31-2025 at 10:00 am

SiFive AI ML RISC V Summit 2025

At the 2025 RISC-V Summit North America, Min Hsu, Staff Compiler Engineer at SiFive, presented on enhancing tiling support within SiFive’s AI/ML software stack for the RISC-V Vector-Matrix Extension (VME). This extension aims to boost matrix multiplication efficiency, a cornerstone of AI workloads. SiFive’s VME implementation introduces a large matrix accumulator state for the result matrix C, leveraging existing RISC-V Vector (RVV) registers to supply source operands A and B. This design enables outer-product-style multiplications directly into the C accumulator, with options for “fat” k>1 support to handle narrower input datatypes. Rows or columns of C can be moved to vector registers or loaded/stored from memory, and the C state may be segmented into multiple tiles. By positioning the accumulator near arithmetic units, the matrix engine achieves high throughput, making it ideal for compute-intensive AI tasks.

A key focus was tiled matrix multiplication, illustrated through a Python pseudocode example. The function tiled_matmul decomposes large matrices A (m x k), B (k x n), and C (m x n) into manageable tiles. Outer loops iterate over tile_m, tile_n, and tile_k dimensions, creating views of sub-matrices (e.g., lhs_tile = A[m1:m1+tile_m, k1:k1+tile_k]). Inner loops then apply register-level tiling with tile_m_v, tile_n_v, and tile_k_v, performing the core operation: dst_tile[mv:mv+tile_m_v, nv:nv+tile_n_v] += np.matmul(lhs_tile_v, rhs_tile_v). This hierarchical tiling optimizes data locality—outer tiles fit into caches, inner ones into registers—reducing memory access overhead and enhancing performance for large-scale AI models.

SiFive’s AI/ML software stack integrates these hardware features seamlessly, enabling end-to-end execution of high-profile models on SiFive platforms. Central to this is the Intermediate Representation Execution Environment (IREE), an open-source MLIR-based compiler and runtime optimized for SiFive microarchitectures. IREE supports diverse front-ends like PyTorch for LLMs, applying target-specific tiling policies to break down operations. It enables intra-operation parallelization, generates code via SiFive’s tuned LLVM compilers and Scalable Kernel Libraries (SKL), and mixes MLIR codegen with microkernels (ukernels) for efficiency. The runtime handles inter-operation parallelization through asynchronous execution and task scheduling, supporting both Linux and bare-metal environments.

Hsu highlighted advancements in multi-tile matrix multiplication within IREE. Previously, IREE supported only single-tile K-loops, where sources A0 and B0 are loaded once, and a single matmul accumulates into C00. Now, enhancements allow multi-tile K-loops, loading sources like A0, A1 once and distributing accumulations across multiple C tiles (e.g., C00 += A0 * B0, C10 += A1 * B0, then C01 += A0 * B1, C11 += A1 * B1). This reduces redundant loads, improving arithmetic intensity and efficiency, especially for deep neural networks where K dimensions are large.

In takeaways, Hsu emphasized that tiled matrix multiplication is essential for high-performance AI/ML applications, as it maximizes hardware utilization. IREE excels in automating and optimizing these tiling strategies. RISC-V’s VME is purpose-built for such tiled operations, delivering native performance gains. SiFive’s XM series implements VME in a compact, integrated form factor, and the team’s contributions to IREE—particularly multi-tile support—further amplify efficiency. This software-hardware synergy positions SiFive’s stack as a robust solution for AI acceleration on RISC-V, bridging custom extensions with standardized ecosystems to drive innovation in edge and datacenter AI.

Bottom line: The presentation underscores SiFive’s commitment to advancing RISC-V for AI, combining architectural extensions with sophisticated compiler tools to tackle compute bottlenecks effectively.

Also Read:

SiFive Launches Second-Generation Intelligence Family of RISC-V Cores

Podcast EP197: A Tour of the RISC-V Movement and SiFive’s Contributions with Jack Kang

Enhancing RISC-V Vector Extensions to Accelerate Performance on ML Workloads


TSMC based 3D Chips: Socionext Achieves Two Successful Tape-Outs in Just Seven Months!

TSMC based 3D Chips: Socionext Achieves Two Successful Tape-Outs in Just Seven Months!
by Daniel Nenni on 12-31-2025 at 6:00 am

Synopsys Socionext 3d IC

Socionext’s recent run of rapid 3D-IC tape-outs is a noteworthy milestone for the industry with two successful tape-outs in just seven months for complex, multi-die designs aimed at AI and HPC workloads. That pace of iteration highlights how advanced packaging, richer EDA toolchains, and closer foundry-ecosystem collaboration are turning what used to be multi-year projects into achievable, repeatable engineering cycles.

At the heart of this acceleration are three interlocking trends: face-to-face 3D stacking that shrinks inter-die latency, process-node specialization across dies (e.g., TSMC N3 compute plus TSMC N5 I/O), and EDA/IP/cloud toolchains purpose-built for multi-die flows. Socionext’s taped-out designs reportedly combine an N3 compute die with an N5 I/O die using TSMC’s SoIC-X 3D stacking, a configuration that reduces interconnect distance and power while increasing bandwidth versus traditional 2D or 2.5D approaches.

Speeding a 3D-IC from concept to tape-out requires more than just clever floorplanning. Mechanical and thermal challenges (warpage, delamination, and heat removal), stringent reliability checks, and new timing/IR signoff flows make multi-die design complex. Socionext’s achievement illustrates how tightly integrated IP (PHYs, SerDes), 3D-aware design rules, and cloud-enabled EDA can remove bottlenecks: by automating design-rule checks for stacked interfaces, enabling distributed compute for large signoff runs, and providing pre-verified IP blocks that support high-speed interconnects. The company itself and partners emphasize that combining proven IP with AI-augmented EDA flows shortened development cycles and improved first-pass quality.

From a product perspective, 3D stacking supports an attractive value proposition for AI and HPC: put logic where it matters, optimize each die on the best process node for that function, and connect them with ultra-dense interfaces to reach system-level PPA (power, performance, area) that 2D designs cannot match. For vendors like Socionext — which target consumer SoCs as well as data-center accelerators — the ability to deliver working 3D-ICs rapidly opens new architectural options (heterogeneous dies, separable I/O fabrics, and modular chiplet ecosystems). Recent Socionext materials also show the company expanding 3DIC and 5.5D packaging support and promoting configurable chiplet building blocks to simplify system assembly.

Industry partnerships are central to this story. Socionext’s work with EDA and IP suppliers, and collaboration within the TSMC OIP ecosystem, demonstrate that 3D-IC success depends on an end-to-end supply chain: foundry stacking capabilities, packaging houses that can handle F2F and 5.5D substrates, EDA tools that understand multi-die timing and thermal behavior, and IP that is 3D-aware. The Synopsys writeup covering Socionext’s timeline explicitly credits the use of Synopsys’ 3D-enabled IP, AI-powered EDA flows, and cloud solutions as instrumental in hitting multiple tape-outs quickly.

What does this mean for the broader market? Faster, repeatable 3D tape-outs lower the barrier to entry for companies wanting to pursue heterogeneous integration. They also pressure incumbents to adopt modular approaches and to invest in multi-die verification and manufacturing readiness. However, scaling from tape-out to high-yield mass production remains the next big hurdle: yields, test strategies, and supply-chain throughput for advanced packaging will determine whether such rapid tape-out cycles translate into volume shipments and cost-effective products.

Bottom line: Socionext’s two tape-outs in seven months are more than a marketing sound bite, they’re a signal that the multi-die era is maturing. With the right mix of IP, EDA, foundry packaging, and ecosystem collaboration, complex 3D systems can move from experimental demos to production-grade devices on timelines that were hard to imagine just a few years ago.

Also Read:

Cerebras AI Inference Wins Demo of the Year Award at TSMC North America Technology Symposium

TSMC Kumamoto: Pioneering Japan’s Semiconductor Revival

AI-Driven DRC Productivity Optimization: Revolutionizing Semiconductor Design


RISC-V Extensions for AI: Enhancing Performance in Machine Learning

RISC-V Extensions for AI: Enhancing Performance in Machine Learning
by Daniel Nenni on 12-30-2025 at 10:00 am

SiFive Risc V Summit 2025

In a presentation at the RISC-V Summit North America 2025, John Simpson, Senior Principal Architect at SiFive, delved into the evolving landscape of RISC-V extensions tailored for artificial intelligence and machine learning. RISC-V’s open architecture has fueled its adoption in AI/ML markets by allowing customization and extension of core designs. However, Simpson emphasized the importance of balancing this flexibility with standardization under profiles like RVA23 to foster an open ecosystem that promotes innovation while preserving differentiation. As AI models grow exponentially—drawing from Epoch AI data showing model sizes surging from vector compute to massive matrix operations, the need for accelerated matrix multiplication and broader datatype support has become critical. Different application domains necessitate varied ISA approaches, but with only a handful of matrix multiply routines, software portability remains relatively unaffected by these choices.

Central to RISC-V’s AI capabilities is the Vector Extension (RVV), which addresses computations beyond matrix multiplies, such as those in activation functions like LayerNorm, Softmax, Sigmoid, and GELU. These operations, involving exponentials and normalizations, can bottleneck throughput when matrix multiplies are accelerated. For instance, prefilling Llama-3 70B with 1k tokens requires 5.12 billion exponential operations. RVV 1.0 supports integer (INT8/16/32/64) and floating-point (FP16/32/64) datatypes, with extensions like Zvfbmin for BF16 conversions and Zvfbwma for widening BF16 multiply-adds. Proposed additions, such as Zvfbta for BF16 arithmetic and Zvfofp8min for OCP FP8 (E4M3/E5M2) via conversions, aim to expand support. Discussions focus on using an altfmt bit in the vtype CSR to encode new datatypes efficiently, avoiding instruction length expansions. Future activity may include OCP MX formats like FP8/6/4, potentially requiring more instruction space or vtype bits.

Simpson outlined several matrix extension approaches under consideration by RISC-V task groups. The Zvbdot extension introduces vector batch dot-products without new state, leveraging existing vector registers. It computes eight dot-products per instruction, with one input from vector A and eight from group B (columns as registers), accumulating in group C. A 3-bit offset accesses up to 64 results. For VLEN=1024 with FP8 inputs and FP32 outputs, it achieves 1K MACs per instruction while writing only 256 bits, accelerating GEMM and GEMV with a vector-friendly read-heavy design.

Integrated Matrix Extensions (IME TG) reuse vector registers as matrix tiles, adding minimal vtype bits. They support matrix-matrix multiplies, with higher arithmetic intensity from longer vectors. Most sub-proposals require new tile load/store instructions, and Option-G is advancing. Write demands for result C might necessitate register renaming in the matrix unit, transparent to software.

Vector-Matrix Extensions (VME TG) add large matrix accumulator state for C, divided into tiles, while using RVV vectors for A and B. Outer-product multiplies accumulate into C, with potential “fat” support for narrower inputs. It includes moves between C and vectors/memory, enabling high throughput by placing accumulators near arithmetic units.

Attached Matrix Extensions (AME TG) introduce separate state for A, B, and C, performing matrix-matrix multiplies independently of RVV. If RVV is absent, new vector operations on matrix state are needed; otherwise, integration is preferred. Requiring dedicated load/store paths, AME offers the largest design space for peak performance, though no consensus proposal exists yet.

Performance varies by approach: Zvbdot suits LLM decode phases with batch=1, accelerating GEMV. IME fits edge devices prioritizing area/power. VME balances vector sourcing with high MACs, while AME maximizes MACs but demands more resources. For LLMs, larger batches improve efficiency but strain KV cache bandwidth.

Bottom line: These extensions position RISC-V as a versatile AI platform, evolving to meet diverse needs from edge to hyperscale. SiFive’s insights highlight ongoing standardization efforts to ensure scalability and ecosystem growth.

Also Read:

SiFive Launches Second-Generation Intelligence Family of RISC-V Cores

Podcast EP197: A Tour of the RISC-V Movement and SiFive’s Contributions with Jack Kang

Enhancing RISC-V Vector Extensions to Accelerate Performance on ML Workloads


Runtime Elaboration of UVM Verification Code

Runtime Elaboration of UVM Verification Code
by Tom Anderson on 12-30-2025 at 6:00 am

AMIQ UVM Runtime Elaboration in DVT IDE

Recently, I reported on my conversation with Cristian Amitroaie, CEO of AMIQ EDA, about automated generation of documentation from design and verification code. Before we chose that topic for a post, Cristian described several capabilities of the AMIQ EDA product family that might be of interest to design and verification engineers. For today’s post, I’ve selected runtime elaboration of Universal Verification Methodology (UVM) code because I wanted to know more about the benefits for engineers working on real-world chip projects.

What do you mean by elaboration?

When our tools read in design and verification code, we check for a wide variety of errors, and then we build a complex internal model that reflects every aspect of the code. For example, in our Design and Verification Tools (DVT) Integrated Development Environment (IDE) family, we perform a full design elaboration. That means we build a model with the complete design hierarchy and all the proper parameters computation, generate blocks computation, binds, etc. This allows design engineers to explore design hierarchies, trace signals and parameters, draw schematic diagrams, and perform many other useful tasks.

How do you handle verification code?

We also build a complete model for verification environments, which are usually based on UVM. Verification engineers often partially mirror the design hierarchy by a tree of components such as drivers and monitors organized in UVM testbench components. They also define and instantiate verification-specific components such as scoreboards and sequencers. All the components are connected together using transaction-level modeling (TLM) ports, defining a verification topology.

Is the verification topology like the design hierarchy?

In some ways yes, but verification topologies are not defined in a static manner like design hierarchies. There is no top module instantiating submodules, and so on, that can be statically computed. The verification topology is controlled per UVM test by activating or deactivating drivers, replacing some components with others tuned to match specific test requirements, connecting specific components to specific design interfaces, etc. The UVM verification component hierarchy is constructed by executing a specific UVM flow at simulation time 0. During this execution, all configuration via the “config db” setters/getters mechanism is performed, all the factory overrides are applied, and more.

What does this mean for DVT IDE?

The bottom line is that verification elaboration cannot be completed until UVM phase 0 (activity at time 0) is executed. We could have called a third-party simulator for this execution, but that takes time and adds overhead. Instead, DVT IDE actually performs a “run 0” internally to allow all the UVM elaboration to happen. We call this process UVM runtime elaboration to reflect its non-static nature.

How does this work in DVT IDE?

Users can ask for the runtime elaboration of a specific UVM test and use breakpoints to debug the “run 0” execution. When a breakpoint interrupts the execution, users can browse the call stacks on each parallel thread and inspect variables. We provide different types of breakpoints, which can be conditional. Users can browse the function call stack and all the breakpoints they’ve set in their project. They can also step through the executed code and inspect variable values, add log points to print information without altering the verification code, and add watchpoints to interrupt upon variable changes.

During UVM runtime elaboration, DVT IDE collects information about factory override definitions and if/where they are applied; information about the config database, including set/get calls and how they are paired; information about the register model, including address and bitfield computation; information about which physical interfaces are connected to virtual interfaces; and information about TLM port connections.

How does this help engineers create, explore, and debug the verification topology?

All this information collected is available in DVT IDE to help engineers explore their verification topology, the tree of components, the register model, the config db, and more. DVT IDE can also display a diagram of all the nested components, including their connections via TLM ports and their connections to the design via virtual interfaces. This is called the UVM Components Diagram.

We can determine some of this verification topology statically, but runtime elaboration allows us to compute actual data that perfectly matches what would happen in a simulator at time 0. Users get all the benefits I’ve mentioned without having to access a simulator. This saves time since the internal UVM runtime elaboration is faster than invoking an external tool that builds a model for full simulation.

What other capabilities benefit the users?

Three things spring to mind. First of all, many verification environments use C models in addition to UVM SystemVerilog code. We support DPI-C calls during “run 0” so this is not an issue. Second, if the verification code changes, users don’t have to go through the compilation and design elaboration process all over again. DVT IDE incrementally analyzes the changes and performs the UVM runtime elaboration. Finally, after the elaboration is done, we save a database that users can load anytime. This means that if there are no changes to the UVM topology, verification engineers can simply load the snapshot without having to execute runtime elaboration again.

Any final thoughts?

The capabilities I’ve listed are robust and well proven by many users over several years. In this post, I’ve only given an overview. To find out more, I recommend a concise tutorial available on our website. Of course, interested verification engineers can contact us to schedule a demo or request an evaluation license.

Thank you for your time, Cristian.

Likewise, and Happy Holidays!

Also Read:

Better Automatic Generation of Documentation from RTL Code

AMIQ EDA at the 2025 Design Automation Conference #62DAC

2025 Outlook with Cristian Amitroaie, Founder and CEO of AMIQ EDA


CISCO ASIC Success with Synopsys SLM IPs

CISCO ASIC Success with Synopsys SLM IPs
by Daniel Nenni on 12-29-2025 at 10:00 am

cisco silicon one networking 839x473

Cisco’s relentless push toward higher-performance networking silicon has placed extraordinary demands on its ASIC design methodology. As transistor densities continue to rise across advanced SoCs, traditional design-time guardbands are no longer sufficient to ensure long-term reliability, consistent performance, and efficient power consumption. Instead, these chips require deep, real-time observability throughout the operational lifecycle. The challenge is addressed through Cisco’s adoption of Synopsys Silicon Lifecycle Management (SLM) IPs. The company’s latest Silicon One ASICs integrate a broad set of embedded monitors and analytics capabilities that collectively redefine what in-silicon visibility looks like.

Modern networking ASICs operate under highly dynamic conditions. Voltage and temperature fluctuate constantly inside dense logic blocks, and variations in process corners across a single die can influence timing behavior in subtle but meaningful ways. Cisco faces additional pressure because its chips target mission-critical infrastructure where uptime, predictability, and performance efficiency are paramount. According to the success story, transistor aging, exacerbated by thermal and voltage cycling, can reduce timing slack over time, making continuous monitoring essential to safeguard performance margins.

To address these challenges, Cisco deployed a comprehensive suite of Synopsys SLM IPs across its newest ASIC platforms. At the center of this strategy is the Process, Voltage, and Temperature Monitor (PVT) subsystem, orchestrated by the PVT Controller (PVTC). The PVTC aggregates data from multiple distributed sensors, enabling a unified view of environmental and process states across the chip. With this real-time data, the system can support dynamic voltage and frequency scaling, optimizing power and performance based on immediate conditions rather than worst-case assumptions.

Several sensor types feed into this controller. The Process Detector identifies variations across silicon regions, helping Cisco tune performance and understand die-to-die differences. Voltage Monitors track fluctuations in supply rails, ensuring critical blocks operate within safe thresholds. Distributed Temperature Sensors and thermal diodes provide granular thermal maps, improving both thermal management and temperature-dependent calibration. Collectively, these sensors give unprecedented visibility into what is happening inside every major functional quadrant of the ASIC.

Beyond PVT data, Cisco uses the Path Margin Monitor to watch critical timing paths in real time. Instead of relying solely on static timing analysis or margin-heavy design, PMM enables early detection of timing degradation due to aging or unexpected workload conditions. Meanwhile, the Clock Delay Monitor focuses on SRAM behavior, measuring access times and ensuring that memory blocks meet their intended timing specifications during actual operation.

The results are substantial. Cisco has achieved significantly enhanced real-time observability across its ASIC designs, enabling dynamic optimization of power and performance rather than fixed guard-banding. The continuous monitoring of path margins and aging allows proactive reliability management, helping extend the usable lifespan of the silicon. The insights generated not only improve today’s chips but also feed back into future design cycles, refining models and guiding architectural decisions. The modular nature of Synopsys SLM IPs also ensures Cisco can tailor sensor density and placement to each ASIC’s unique requirements, balancing efficiency with coverage.

Bottom line: Cisco plans to leverage Synopsys Silicon.da analytics to mine the vast data produced under diverse operating conditions. This data-driven feedback loop positions Cisco to continue advancing high-performance networking silicon while reducing risk and improving consistency across its product lines. Through its collaboration with Synopsys, Cisco has established a new benchmark for ASIC observability, reliability, and lifecycle optimization in the networking domain.

https://www.synopsys.com/success-stories/cisco-enhances-asic-slm.html
Also Read:

How PCIe Multistream Architecture Enables AI Connectivity at 64 GT/s and 128 GT/s

WEBINAR: How PCIe Multistream Architecture is Enabling AI Connectivity

Lessons from the DeepChip Wars: What a Decade-old Debate Teaches Us About Tech Evolution


RISC-V: Powering the Era of Intelligent General Computing

RISC-V: Powering the Era of Intelligent General Computing
by Daniel Nenni on 12-29-2025 at 8:00 am

Andes RISC V Summit 2025 Charlie Su

Charlie Su, President and CTO of Andes Technology, delivered a compelling keynote at the 2025 RISC-V Summit North America, asserting that RISC-V is primed to drive the burgeoning field of Intelligent General Computing. This emerging paradigm integrates AI and machine learning into everyday computing devices, from AI-enabled PCs and smartphones to edge servers, software-defined vehicles, and robotic platforms. Su emphasized that advancements in AI/ML are infusing intelligence into general-purpose computing, enabling applications in personal use, factory automation, surveillance, drones, and autonomous driving (ADAS Levels 0-4). He predicted that robots, as app-enabled platforms, could surpass the smartphone market in scale. To support this, Intelligent General Computing demands a robust ecosystem for both general-purpose tasks and large-scale AI/ML, encompassing software and hardware.

Charlie highlighted RISC-V‘s role in fostering innovations for large-scale AI/ML. A prime example is Meta’s Training and Inference Accelerator (MTIA), which leverages Andes’ vector and scalar cores alongside the Automated Custom Extension (ACE) framework, as detailed in ISCA 2023. Two generations of MTIA have been deployed in Meta’s data centers since 2023, based on RISC-V processors with automated extensions. Other accelerators using SRAM-based Compute-In-Memory include solutions for servers (e.g., RiVos AI SoC), cloud services (SAPEON), photonics-based AI, and ADAS systems. These are powered by Andes cores like AX46MPV, AX45MPV, NX27V, and AX65, demonstrating RISC-V’s versatility in high-performance AI.

The RISC-V software ecosystem is maturing rapidly, bolstered by initiatives like RISE (RISC-V Software Ecosystem), which accelerates open-source software development, improves quality, and aligns efforts for cloud and IoT devices. Java 22/21 support is already in place, with tools spanning compilers (LLVM, GCC, GLIBC), system libraries (FFmpeg, OpenBLAS), kernel/virtualization (Linux, Android, Performance Profiles), and more. Premier members include Andes, Google, Intel, NVIDIA, Qualcomm, and Samsung. Debian’s open-source support underscores this maturity, with RISC-V achieving a 98.4% successful build rate across over 64,000 packages—ranking third overall. Metanoia’s 5G O-RAN software architecture further exemplifies modular, full open-source releases for semi-turnkey solutions.

Andes’ processor lineup is tailored for this era. The AX46MPV offers powerful compute and efficient control, compliant with RVA22+ including AIA and SV38/48/57 virtualization. It features dual-issue for vector/scalar instructions, a Vector Processing Unit (VPU) with VLEN/DLEN from 128-1024 bits, supporting int4-int64 and bf16/fp16-64 formats, plus enhanced ReductionSum. Multicore support reaches 16 cores, with boosted memory via dual-issue load/store, strong outstanding capabilities, and a High-speed Vector Memory (HVM) interface handling multiple OOO requests. Performance gains over AX45MPV include ~18% in SpecInt2006 (5.65 score), over 2x in key vector libraries (libvec, libnn), and +40% bandwidth.

The AX66, a mid-range application processor, is RVA23 compliant with dual vector pipes (VLEN=128), 4-wide frontend decode, 128-entry ROB, 8 execution pipelines, and TAGE-L branch predictor. It supports up to 8 cores, 32MB shared L3 cache (mostly exclusive), and 128/256-bit AXI4 interfaces with IOMMU, APLIC, and CHI. Vector performance yields >10x in libnn key functions (9.6x average), >4x in libvec (3.55x average), and significant crypto boosts (4.7x SHA-256, 10.5x AES-128, 6.4x SM4). Bandwidth increases by 25%.

For high-end needs, the Cuzco series scales to 20 SpecInt2k6/GHz, with patented time-based scheduling via Time Resource Matrix for efficient instruction issuing and power reduction. RVA23 compliant, it features 8-wide decode, 256 ROB entries, 8 pipelines (2 per slice), advanced branch prediction, private L1/L2 caches, up to 256MB shared L3, multiprocessor up to 8 cores, and CHI/256-bit MMIO. Early 5nm implementation targets 2.5GHz, with current SpecInt2006 at ~18/GHz, using 7M gates for CPU and 4.5M for 2MB L2.

Andes enhances the ecosystem with AndesAIRE, an “AI Runs Everywhere” end-to-end solution, including IDEs, NN SDKs, compilers (MLIR, TVM), interpreters (ONNX Runtime, PyTorch), and accelerators like AndLA 1350. OS support is comprehensive: RISC-V specs (RVA22/23 profiles, SoC platforms), Linux distros (Debian, Fedora, Ubuntu, verified by Andes), upstream kernel features (strace/ftrace, Perf, HIGHMEM, CPU hotplug, ongoing Suspend-to-RAM and PowerBrake), bootloaders (U-Boot, OpenSBI), and RTOS (FreeRTOS, Zephyr, Thread-X).

Bottom line: Charlie noted Andes leads RISC-V IP shipments with rich portfolios. The latest processors—AX46MPV for compute/control, AX66 to Cuzco for performance—position Andes strongly. The RISC-V ecosystem is ready for Intelligent General Computing, promising transformative impacts across industries.

Contact Andes

Also Read:

Journey Back to 1981: David Patterson Recounts the Birth of RISC and Its Legacy in RISC-V

Google’s Road Trip to RISC-V at Warehouse Scale: Insights from Google’s Martin Dixon

Bridging Embedded and Cloud Worlds: AWS Solutions for RISC-V Development


Simulating Quantum Computers. Innovation in Verification

Simulating Quantum Computers. Innovation in Verification
by Bernard Murphy on 12-29-2025 at 6:00 am

Innovation New

Quantum algorithms must be simulated on classical computers to validate correct behavior, but this looks very different from classical logic simulation. Paul Cunningham (GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and lecturer at Stanford, EE292A) and I continue our series on research ideas. As always, feedback welcome.

The Innovation

This month’s pick is How to Write a Simulator for Quantum Circuits from Scratch: A Tutorial. The authors are from École de Technologie Supérieure, Montreal and the University of Massachusetts. The paper was posted in June 2025 in arXiv.

Quantum simulators work on an abstraction – how qubits and “gates” are implemented is a fascinating topic but a distraction for this discussion. Our goal in this review is to introduce the topic of simulating quantum algorithms on a classical computer, because these methods are sufficiently disjoint from familiar classical computation to require an introduction before we move onto new research in this area.

This paper introduces a method to build a simulator for a small quantum computer (~20 qubits). It is supported by a web-based implementations and code walkthroughs to give a sense of how quantum simulation works. You should think of linear algebra methods to evaluate a circuit, multiplying an initial qubit vector by a series of tensors corresponding to gates in the circuit.

Paul’s view

Quantum venture funding is already well over $3B, rising fast and getting a lot of attention in the media. So what about verifying quantum circuits? First stop here is a quantum circuit simulator. Kudos to Bernard for finding a wonderfully written paper on this topic. It describes notations used for describing quantum circuits, both graphically and in equation form. It also works through the basic math needed to understand how a quantum simulator works. It’s an algorithmic level paper, not a paper on quantum physics.

In a digital circuit, each “bit” of state (a register or a wire) can be read and written independently. Logic simulators need to process transitions on registers and wires in time order, via an event queue, but this processing is local and need only consider the gates they are directly connected to.

In the quantum world “qu-bits” of state are “entangled” and need to be considered collectively as a single “state vector”. Simulating a quantum circuit proceeds like an analog circuit simulation where a vector of all the voltages or currents on each wire is formed, and simulation involves multiplying this vector with a matrix whose coefficients are determined by the circuit components and connectivity. For a circuit with n wires an analog simulator must multiply a 1 x n circuit state vector by an n x n simulation matrix derived from the circuit structure.

The cool thing about a quantum circuit is that a circuit with n qubits has a state vector with 2^n elements, one for each of the 2^n binary representations of n-bits. A quantum circuit performs operations simultaneously on all 2^n elements of this state vector, with means it conceptually operates in parallel on all 2^n possible values of the n qubits.

To simulate a quantum circuit with non-quantum digital hardware means multiplying a quantum state vector of size 2^n by a simulation matrix of size 2^n x 2^n, which is O(4^n) multiplications. The paper works through some neat algorithmic tricks based on some fundamental properties of quantum state vectors and simulation matrices that improves the runtime complexity to O(n.2^n). The elements of the state vector are floating point numbers, so the entire simulation maps very well to GPUs, e.g. this NVIDIA blog claims evaluating up to 36 qubits using eight A100s. Wow!

Each element in the state vector is a complex number, whose magnitude squared is the probability of the circuit being in that state. The sum of all the magnitude squared across the whole state vector is 1 and you can think of the state vector as representing a point on the surface of a 2^n dimensional hypersphere whose radius is 1. The goal of a typical quantum circuit algorithm is to use quantum gates to move the state vector around this hypersphere until it points almost perfectly along the axis of the dimension that is the desired result of the algorithm. Logic gates in digital circuits perform Boolean operations on state bits to calculate their result. Quantum gates rotate state vectors in various ways around their hypersphere. Developing a quantum algorithm requires figuring out a combination of rotational operations that move the state vector towards the desired result. Let’s see what Bernard can find published on what it means to verify these kinds of algorithm.

Raúl’s view

This month’s paper is a very nice, detailed tutorial on how to build a quantum circuit simulator using classical computing techniques, even with minimal prior knowledge of quantum mechanics. A simulator is verification 101; the purpose of creating a simulator from scratch is not as an alternative to existing open-source and commercial packages, but for a deeper understanding of quantum computing and the core algorithms necessary. It introduces essential quantum concepts and notations such as Dirac notation, state vectors, Hilbert space, tensor products, and the Bloch sphere, and quantum gates such as Hadamard, SWAP, Toffoli (CCNOT), Pauli X, Y and Z. Unlike physical quantum computers which collapse the state to 0 or 1 when measured, simulators can directly compute the complete state, including the probability of a 1 and the phase (Bloch sphere coordinates of each qubit). Measurement gates collapse the state and result in two new state vectors, corresponding to a measurement of 0 and 1.

The resulting simulator can handle up to ~20 qubits on a personal computer, utilizing roughly 1000–2000 lines of code in JavaScript (the largest quantum computer than can be simulated on a HPC is 50 qubits). An emphasis is placed on efficiency to handle the computational complexity associated with explicit matrix multiplication, in particular for Qubit-Wise Multiplication without explicitly forming the large layer matrices, but still O(2n nd) for d layers with n gates each; and SWAP, the exchange of the states of qubits simulated by directly manipulating the indices of the state vector’s amplitudes, also exponential in complexity. Further enhancements mentioned include adding robust error checking, implementing memory-saving in-place updates, and leveraging hardware acceleration via GPU programming.

I found the paper a great introduction to quantum computing. The online simulators help explain the basics, and the paper references commercial systems and more advanced research for readers interested in more detail.

Also Read:

Quantum Advantage is About the Algorithm, not the Computer

Quantum Computing Technologies and Challenges

Quantum Computing Algorithms and Applications


Kirin 9030 Hints at SMIC’s Possible Paths Toward >300 MTr/mm2 Without EUV

Kirin 9030 Hints at SMIC’s Possible Paths Toward >300 MTr/mm2 Without EUV
by Fred Chen on 12-28-2025 at 2:00 pm

Number of masks required for the M0 through M3 layers

Earlier this month, TechInsights did a teardown of the Kirin 9030 chip found in Huawei’s Mate 80 Pro Max [1]. Two clear statements were made on the findings: (1) the transistor density of SMIC’s “N+3” process was definitely below that of the earlier 5nm processes from Samsung and TSMC, and (2) metal pitch was aggressively scaled using DUV multi-patterning. Given that the density (formula defined in [2]) is less than 125 MTr/mm2 (Samsung 5LPE), corresponding to a track pitch of 36 nm and gate pitch of 54 nm [3], we can infer that it is the minimum metal pitch that was aggressively scaled, going beyond double patterning. In this article, we will go over the possible paths ahead for SMIC that could ultimately enable transistor densities >300 MTr/mm2, knowing that minimum metal pitch is now likely being patterned by some form of self-aligned quadruple patterning (SAQP).

Some Guiding Numbers for Pitch Scaling

The actual pitches for SMIC’s latest N+3 and previous N+2 processes were found by TechInsights but never revealed publicly. When those processes are discussed in this article, representative pitches will be used.

Thanks for reading Multiple Patterns! Subscribe for free to receive new posts and support my work.

It will be assumed that getting to >300 MTr/mm2 will follow the path shown in Table 1. At N+3, the M0 layer was shrunk aggressively; this will be repeated for M2 at N+4.

Table 1. Possible pitch shrink path from N+2 to >300 MTr/mm2. See text for explanations.

A number of clarifications are needed to explain the numbers used in Table 1.

The transistor density is calculated from the gate pitch and the track pitch, which is taken to be M2 here. We know that for N+3, the track metal is not the minimum pitch metal. The formula is the same as used in [2], with 60% weight on 4-transistor NAND cells covering 3 gate pitches, and 40% weight on a 32-transistor flip-flop covering 19 gate pitches. This gives [6*4/3+0.4*32/19]/(gate pitch*cell height)=1.474/(gate pitch*cell height) as the transistor density formula.

At the “2nm” node, the transition to buried power rail is expected, which enables the cell height to go from 6 tracks to 5 tracks.

For older nodes, M1 pitch can be less that gate pitch, e.g., 2/3 gate pitch, but 36 nm pitch with EUV has stochastic defect density concerns [4,5], so it has been expected that M1 pitch will be relaxed to the same as gate pitch.

A 44 nm gate pitch and 22 nm pitch with buried power rails allowing 5-track cells would be necessary to get over 300 MTr/mm2.

Different SAQP Approaches Proposed

Achieving a minimum metal pitch as small as 30 nm or smaller is no trivial feat. Two methods have been proposed by Huawei and SiCarrier.

Double SALELE

Huawei’s patent CN117751427 discloses what is essentially the SALELE [6] approach applied twice. “SALELE” stands for “self-aligned litho-etch-litho-etch;” it is a more sophisticated version of the traditional litho-etch-litho-etch double patterning approach. Double SALELE means doing SALELE twice to get the quadruple patterning effect (Figure 1).

Figure 1. Double SALELE approach. Left: First litho-etch (blue), followed by spacer (gray), then etch block/cut (yellow). Center: Second litho-etch (green), followed by etch block/cut (purple). This completes the first SALELE. Right: Second SALELE completed.

In the SALELE approach, sidewall spacers are applied to a first set of lines, formed conventionally, by “litho-etch.” Then these lines may be cut using etch blocks patterned by a second mask. A third mask is used to pattern the second set of lines, with alignment assisted by the sidewall spacers. Then, this second let of lines is cut, using a fourth mask.

This approach consumes an excessive number of masks. Four masks are needed for four sets of lines, so that each line printed by a given mask is separated by sufficient distance (≥ minimum allowed pitch). Four additional masks are needed for the etch block/cut locations, corresponding to each of the four sets of lines. This gives a total of eight masks! Fortunately, this is not the only approach.

Double SADP

SiCarrier’s patent CN117080054 [7] discloses an SAQP-class approach that uses half the number of masks used for double SALELE. In a way, it is a kind of cascaded, double self-aligned double patterning (SADP) (Figure 2).

Figure 2. Double SADP approach. Left: First spacers (gray) are formed on sidewall of mandrel pattern (blue). Center left: Etch block/cut (black) is applied to the spacer pattern. This completes the first SADP. Center right: Second spacers (yellow) are formed on the sidewalls of the first spacer pattern, followed by a gap fill (green). Etch block/cut (red) is applied to the gap fill pattern. This completes the second SADP. Right: Wide features are formed with a separate (fourth) mask.

The first SADP leaves a set of first spacers which correspond to the first set of metal lines. The gaps left after the second follow-on SADP correspond to the second set of metal lines. Wide metal lines are completed at the end. Like in SALELE, the two sets of lines are cut separately. However, SADP enables twice the line density compared to a single litho-etch, and the cuts can also be made two lines at a time. Thus, the number of masks is halved from 8 to 4.

Diagonal FSAV Grid Becomes a Must

With metal pitches of 30 nm or less, metal linewidths become 15 nm or less. It is actually difficult to focus, even with High-NA EUV, down to a spot as small as this; the Rayleigh resolution limit would be 0.61 wavelength/NA = 0.61*13.5/0.55 = 15 nm. But looking ahead to the sub-2nm node, stochastics will become the overwhelming reason why even with High-NA EUV, directly printing a via is not feasible (Figure 3).

Figure 3. Absorbed photon density (1 nm pixel) for a 22 nm x 11 nm via on 44 nm x 22 nm pitch, with 6 mJ/cm2 absorbed EUV dose.

Lithographic difficulty has been a key driving reason for using diagonal via grids [8]. The minimum via pitch at advanced nodes cannot be as small as the minimum metal line pitch (Figure 4). Routing doesn’t require it anyway [8,9].

Figure 4. Left: The minimum via pitch cannot be as small as the minimum metal line pitch. Right: Diagonal via locations could be allowed.

It will become necessary to fill the intersection area between vertically adjacent metal layers, using the fully self-aligned via process [10]. A focused EUV spot will be wider than the metal linewidth at 3nm and below.

Based on the pitches in Table 1, we can predict the maximum number of masks used for patterning the V0, V1, and V2 layers. With ArF immersion, we allow 80 nm distance between vias [11]. Brute force via multipatterning will result in up to four masks used (Figure 5). A more efficient approach that fits the diagonal via grid is to use LELE double patterning to print portions of diagonal lines that cover the targeted via locations; a third mask would be used if needed to trim the portions if necessary.

Figure 5. Via multipatterning options. Each color represents a different mask. Top left: double patterning is sufficient for N+2, and some via layers of N+3. Top right: triple patterning would become necessary for N+4 and N+5. Bottom left: for N+6, quadruple patterning would become a necessary allowance if still using brute force multipatterning. Bottom right: Diagonal LELE (plus trim mask if necessary) is most efficient for accommodating the diagonal via grid.

Counting Cuts

Besides vias, metal line cuts add significantly to the mask count. For the M0 and M2 layers, the double SADP approach only requires two cut masks, while the double SALELE approach depends on the node pitches. The distances between cuts follow the same rules as for the vias. It could go up to four masks for the 1.x nm node (Figure 6).

Figure 6. Cut mask count for double SALELE metal layers. Each color represents a different mask.

The M1 and M3 layers are likely patterned by SALELE, so that narrow straight line cuts may be used to cut alternate lines, skipping lines in between. This would mean up to four masks (Figure 7).

Figure 7. Cut mask count for the SALELE metal layers (M1 and M3). Each color represents a different mask.

For EUV, SALELE cuts would still require two masks. Thus, DUV quadruple patterning for this purpose is still cheaper than EUV double patterning [12].

Smooth Ride Forward?

When the mask count increases for the M0 through M3 layers are tallied up for the different possible approaches, we get the overall result in Figure 8.

Figure 8. Number of masks required for the M0 through M3 layers for the representative nodes N+2 through N+6, for the different possible multipatterning combinations. “2xSALELE” = double SALELE, “2xSADP” = double SADP, “DFSAV” = diagonal line LELE on FSAV, with trim mask. SALELE is assumed applied to the M1 and M3 layers.

The double SALELE approaches will consistently require more masks than the double SADP approaches. The use of diagonal line double patterning with trim mask on FSAV saves three masks for N+6 (44 nm pitch M1, 22 nm pitch M0 and M2). In the best case, only 7 masks have been incrementally added from N+2 to N+4, and the total remained unchanged until N+6. This is to be compared with the worst case, where the mask count increase from N+2 continued after N+5, leading up to 18 masks for N+6.

N+5 is seen to be a convenient shrink of N+4, with no added masks.

Thus, the multipatterning path must be carefully planned several nodes ahead in advance in order to ensure that mask count increase can be manageable.

References

[1] R. Krishnamurthy, “SMIC Steps Toward 5nm: Kirin 9030 Analysis Shows the Foundry’s N+3 Progress,” TechInsights.

[2] Skyjuice, “The Truth of TSMC 5nm,” Angstronomics.

[3] D. Schor, “Samsung 5 nm and 4 nm Update,” Wikichip Fuse.

[4] Y. Li, Q. Wu, Y. Zhao, “A Simulation Study for Typical Design Rule Patterns and Stochastic Printing Failures in a 5 nm Logic Process with EUV Lithography,” CSTIC 2020.

[5] Y-P. Tsai, C-M. Chang, Y-H. Chang, A. Oak, D. Trivkovic, R-H. Kim, “Study of EUV stochastic defect on wafer yield,” Proc. SPIE 12954, 1295404 (2024).

[6] Y. Drissi, W. Gillijns, J. U. Lee, R. R-H. Kim, A. Hamed-Fatehy, R. Kotb, R. N. Sejpal, F. Germain, J. Word, “SALELE Process from Theory to Fabrication,” Proc. SPIE 10962, 109620V (2019).

[7] F. Chen, “SiCarrier’s SAQP-Class Patterning Technique: a Potential Domestic Solution for China’s 5nm and Beyond,” Multiple Patterns.

[8] S-W. Peng, C-M. Hsiao, C-H. Chang, J-T. Tzeng, US Patent Application 20230387002; Y-C. Xiao, W. M. Chan, K-H. Hsieh, US Patent 9530727.

[9] F. Chen, “Exploring Grid-Assisted Multipatterning Scenarios for 10A-14A Nodes,” Multiple Patterns.

[10] J-H. Franke, M. Gallagher, G. Murdoch, S. Halder, A. Juncker, W. Clark, “EPE analysis of sub-N10 BEoL flow with and without fully self-aligned via using Coventor SEMulator3D,” Proc. SPIE 10145, 1014529 (2017).

[11] M. Burkhardt, Y. Xu, H. Tsai, A. Tritchkov, J. Mellmann, “Ultimate 2D Resolution Printing with Negative Tone Development,” Proc. SPIE 9780. 97800E (2016).


Podcast EP324: How Dassault Systèmes is Creating the Next Generation of Semiconductor Design and Manufacturing with John Maculley

Podcast EP324: How Dassault Systèmes is Creating the Next Generation of Semiconductor Design and Manufacturing with John Maculley
by Daniel Nenni on 12-26-2025 at 10:00 am

Daniel is joined by John Maculley, Global High-Tech Industry Strategy Consultant at Dassault Systèmes. John has over 20 years of experience advancing innovation across the semiconductor and electronics sectors. Based in Silicon Valley, he works with leading foundries, OSATs, design houses, and research institutes worldwide to accelerate technology co-optimization and strengthen ecosystem resilience.

In this informative and forward-looking discussion, Dan explores the evolving focus on what kind of IP is curated and leveraged in the semiconductor industry with John, who describes knowledge and know-how as the new strategic differentiating IP for many companies. He explains why the ability to codify and curate this information across the enterprise is becoming quite valuable. John describes how IP management is now shifting to governance and intelligence, and with AI-augmented IP engineers can now design with a focus on manufacturability.

John discusses many other benefits of the work Dassault Systèmes is doing to facilitate an AI-augmented future for the semiconductor industry. Methods to capture institutional knowledge and make it available to all members of the team are discussed. The impact to design productivity for advanced 3DIC systems is significant.

The views, thoughts, and opinions expressed in these podcasts belong solely to the speaker, and not to the speaker’s employer, organization, committee or any other group or individual.


Why TSMC is Known as the Trusted Foundry

Why TSMC is Known as the Trusted Foundry
by Daniel Nenni on 12-26-2025 at 6:00 am

TSMC Ivey Fab

Taiwan Semiconductor Manufacturing Company (TSMC) is widely regarded as the world’s most trusted semiconductor foundry, a reputation built over decades through technological leadership, business model discipline, operational excellence, and reliability. In an industry where trust is as critical as transistor density, TSMC has become the backbone of the global digital economy.

First and foremost, TSMC’s pure-play foundry model is the foundation of its trustworthiness. Unlike integrated device manufacturers (IDMs) such as Intel and Samsung, which design and manufacture their own chips, TSMC does not compete with its customers. It manufactures chips exclusively for third parties and has maintained a strict firewall between customer designs. This neutrality reassures customers, from Apple and NVIDIA to AMD, Qualcomm, and countless startups, that their intellectual property will not be used against them. Over time, this consistency has created deep confidence across a vast ecosystem, making TSMC the default manufacturing partner for the world’s most valuable chip designers.

Second, TSMC’s technological leadership reinforces that trust. The company has consistently been first, or decisively best, to mass-produce advanced process nodes such as 7nm, 5nm, and 3nm at high yields. In semiconductor manufacturing, reliability is not just about innovation, but about delivering that innovation at scale, on schedule, and with predictable silicon. TSMC’s ability to translate cutting-edge research into stable, high-volume production has made it indispensable for customers whose product cycles depend on certainty. When companies commit billions of dollars to a chip design, they need confidence that the foundry can deliver exactly as promised and TSMC has repeatedly proven it can.

Third, manufacturing excellence and yield consistency distinguish TSMC from competitors. Advanced chips are extraordinarily complex, and small variations can destroy profitability or product viability. TSMC’s laser focus on process control, defect reduction, and continuous improvement results in industry-leading yields. High yields mean lower costs for customers, faster ramp-ups, and fewer surprises after tape-out. This operational discipline is a major reason customers trust TSMC with their most advanced and sensitive designs.

Fourth, TSMC has built a reputation for strong intellectual property protection and confidentiality. Semiconductor designs represent years of research and billions in investment. TSMC has demonstrated, across thousands of customers, that it can securely handle highly confidential data without leaks or misuse. This trust is reinforced by TSMC’s internal culture, strict access controls, and long-standing customer relationships. In an era of increasing cyber and industrial espionage, this reliability is invaluable.

Fifth, TSMC’s scale and ecosystem integration create trust through inevitability. The company has invested hundreds of billions of dollars in fabrication plants, equipment, and talent, creating manufacturing capabilities that few others can match. Its close collaboration with equipment suppliers (such as ASML and Applied Materials), EDA vendors (Synopsys, Cadence, Siemens EDA), and IP companies (Synopsys, Arm, Analog Bits) also known as the Grand Alliance allows customers to design within a mature, silicon-proven and well-supported ecosystem. This reduces risk and shortens time-to-market, further cementing TSMC as the safest choice.

Sixth, TSMC’s long-term strategic thinking strengthens customer confidence. The company invests aggressively ahead of demand, often years before returns are guaranteed. This willingness to absorb risk ensures that capacity is available when customers need it, even during industry upcycles or shortages. During recent global chip shortages, TSMC’s capacity planning and prioritization reinforced its image as a stable, responsible industry steward.

Finally, TSMC’s global credibility and governance matter. While geopolitical risks exist, TSMC has demonstrated transparency, regulatory compliance, and cooperation with governments and customers worldwide. Its expansion into the United States, Japan, and Europe reflects a commitment to supply chain resilience and global trust.

Bottom line: TSMC is the trusted foundry not because of a single advantage, but because of a rare combination: neutrality, technological supremacy, manufacturing reliability, IP protection, scale, and long-term vision. In an industry where failure is catastrophic and trust is earned slowly, TSMC has become the gold standard and the cornerstone of modern semiconductor manufacturing.

Also Read:

TSMC’s Customized Technical Documentation Platform Enhances Customer Experience

A Brief History of TSMC Through 2025

Cerebras AI Inference Wins Demo of the Year Award at TSMC North America Technology Symposium