RVN! 26 Banner revised (800 x 100 px) (600 x 100 px)

Podcast EP315: The Journey to Multi-Die and Chiplet Design with Robert Kruger of Synopsys

Podcast EP315: The Journey to Multi-Die and Chiplet Design with Robert Kruger of Synopsys
by Daniel Nenni on 10-31-2025 at 10:00 am

Daniel is joined by Robert Kruger, product management director at Synopsys, where he oversees IP solutions for multi-die designs, including 2D, 3D, and 3.5D topologies. Throughout his career, Robert has held key roles in product marketing, business development, and roadmap planning at leading companies such as Intel, Broadcom, Nokia, and Altera. He brings extensive expertise in semiconductor technologies, including ASICs and FPGA products, as well as deep knowledge of specialized requirements across various sectors, including wireless infrastructure, military, automotive, industrial, and data center markets.

Dan explores emerging multi-die/chiplet design with Robert, who covers a wide range of topics. Robert discusses the market drivers for chiplets, both from a captive and open market perspective. He explains how markets will mature based on both business and technical needs. Roberts describes what the journey to a chiplet-based multi-die design approach looks like.

He discusses the value of IP subsystems and how they are built. Robert also covers the design and verification challenges associated with this new design approach and concludes with a summary of the capabilities Synopsys brings to the market to enable multi-die and chiplet-based solutions.

The views, thoughts, and opinions expressed in these podcasts belong solely to the speaker, and not to the speaker’s employer, organization, committee or any other group or individual.


Intel to Compete with Broadcom and Marvell in the Lucrative ASIC Business

Intel to Compete with Broadcom and Marvell in the Lucrative ASIC Business
by Daniel Nenni on 10-31-2025 at 6:00 am

Lip Bu Tan Intel

The second chapter of our book “Fabless: The Transformation of Semiconductor Industry” describes the ASIC business and how important it is. That was more than 10 years ago and the ASIC business is still at the forefront of the Semiconductor industry and is a key enabler of the AI revolution we are experiencing today.

First let’s talk about the ASIC business then let’s talk about what Lip-Bu Tan said in his prepared statement on the most recent investor call about the new Intel Central Engineering Group and the ASIC and design services business.

Application-Specific Integrated Circuits are custom-designed semiconductors engineered for specific applications, delivering unmatched performance, power efficiency, and cost-effectiveness compared to general-purpose chips like CPUs or GPUs. ASICs are pivotal in high-volume markets such as artificial intelligence, high-performance computing, telecommunications, automotive, and consumer electronics. Unlike programmable devices, ASICs are optimized for specific tasks, enabling innovations like AI accelerators, 5G/6G infrastructure, and edge computing. The ASIC ecosystem thrives on collaboration between fabless design houses, foundries, and end-users, driving technological advancements.

As of 2025, the global ASIC market is valued at more than $20 billion and is expected to double in the next five years. Key growth drivers include surging demand for AI, edge computing, and advanced connectivity (5G/6G). Systems companies are now doing their own ASICs often times using an ASIC company to jump start internal chip design efforts. Apple’s first iPhone SoC used an ASIC service as did Google’s first TPU.

There are two basic types of ASIC companies, semiconductor companies like Broadcom and Marvell who also do ASICs and ASIC specific services companies like Alchip and AION Silicon who just do ASICs thus not competing with customers. Here are brief descriptions of the ASIC companies I know personally. Alchip and AOIN Silicon are current SemiWiki partners.

Broadcom Inc.

Broadcom is a semiconductor giant with a robust ASIC portfolio. It designs custom silicon for networking, storage, broadband, and AI, serving hyperscale data centers with tailored chipsets, such as accelerators for clients like Google. Broadcom’s ASIC business blends merchant silicon with bespoke designs, contributing significantly to its revenue. In Q2 2025, Broadcom reported record revenues, with custom ASIC segments achieving gross margins exceeding 50%. Its strength lies in AI leadership and diversified offerings.

Marvell Technology Inc.

Marvell specializes in data infrastructure semiconductors, with a growing focus on custom ASICs for AI, 5G, and cloud computing. Transitioning from storage controllers, Marvell now prioritizes high-speed, low-power SoCs and interconnects. Its Q2 FY2026 revenue surged 58% driven by ASIC demand from AI and networking sectors.  Partnering with leading foundries, Marvell is well-positioned for AI-driven growth, emphasizing scalable, high-performance silicon solutions.

Alchip Technologies

Taiwan-based Alchip, established in 2003 in Taipei, is a fabless ASIC leader specializing in HPC and AI. Renowned for rapid prototyping and first-silicon success, Alchip collaborates with tier-one cloud providers and TSMC, leveraging advanced nodes for machine learning accelerators and automotive chips. As a TSMC Value Chain Alliance member, Alchip offers end-to-end services, from SoC design to manufacturing, ensuring high-performance, low-latency solutions. Its 2025 focus on sustainable silicon design strengthens its competitive edge in AI and networking markets.

AION Silicon

AION Silicon, a lesser-known but emerging player, focuses on innovative ASIC solutions for AI and IoT applications. Based in the U.S., AION emphasizes customizable, high-efficiency chips for edge computing and smart devices. While smaller than Broadcom or Marvell, AION’s agile approach and partnerships with foundries position it for growth in niche markets. Its 2025 roadmap highlights low-power AI accelerators, targeting cost-sensitive applications.

Arm, Qualcomm, MediaTek and other chip companies have also joined the custom silicon business but that is another story. Now let’s talk about what Lip-Bu announced:

“By connecting our architectures through Nvidia NVLink, we combine Intel CPU and x86 leadership with Nvidia unmatched AI and accelerated computing strengths, unlocking innovative solutions that will deliver better customer experience and provide a beachhead for Intel in the leading AI platform of tomorrow. We need to continue to build on this momentum and capitalize on our position by improving our engineering and design execution. This includes hiring, promoting top architecture talents, as well as reimagining our core roadmap to ensure it is the best-in-class features. To accelerate this effort, we recently created the Central Engineering Group, which will unify our horizontal engineering functions to drive leverage across foundational IP development, test chip design, EDA tools, and design platforms. This new structure will eliminate duplications, improve time to decision-making, and enhance coherence across all product development.”

“In addition, and just as important, the group will spearhead the build-out of a new ASIC and design service business to deliver purpose-built silicon for a broad range of external customers. This will not only extend the reach of our core x86 IP, but also leverage our design strengths to deliver an array of solutions from general purpose to fixed-function computing.”

Bottom line: Brilliant move by Lip-Bu Tan! The ASIC business is critical but it is also VERY competitive. Rather than trying to compete with TSMC’s Value Chain Alliance or acquiring a large ASIC group (which is what Broadcom and Marvell did), Intel doing custom ASICs centered around Intel/Nvidia IP using Intel Foundry manufacturing and packaging is the right thing to do, absolutely.

Fill those fabs!

Also Read:

Yes Intel Should Go Private

AI Revives Chipmaking as Tech’s Core Engine

Advancing Semiconductor Design: Intel’s Foveros 2.5D Packaging Technology


Quadric: Revolutionizing Edge AI

Quadric: Revolutionizing Edge AI
by Daniel Nenni on 10-30-2025 at 10:00 am

Revolutionizing Edge AI SemiWiki Blog Image

In the rapidly evolving landscape of AI, Quadric stands out as a pioneering force in edge computing. Founded in 2018 and headquartered in Burlingame, California, Quadric is a technology company focused on developing high-performance, energy-efficient processors for AI workloads at the edge devices like smartphones, IoT sensors, autonomous vehicles, and industrial robots. Its flagship product, the Chimera processor, integrates the flexibility of general-purpose computing with the efficiency of specialized AI hardware, addressing the growing demand for on-device AI processing. This blog explores Quadric’s mission, technology, and impact on the edge AI ecosystem, highlighting its role in shaping the future of intelligent devices.

Quadric’s core innovation lies in its Chimera GPNPU, a hybrid processor that bridges the gap between traditional CPUs/GPUs and dedicated AI accelerators like TPUs. Unlike conventional neural processing units optimized solely for deep learning inference, the Chimera GPNPU combines a programmable architecture with specialized AI capabilities. This allows it to handle diverse workloads, including machine learning inference, signal processing, and classical computing tasks, all within a single licensable processor IP core. By unifying these functions, Quadric eliminates the need for multiple specialized processors, reducing complexity, power consumption, and cost—critical factors for edge devices where space and energy are constrained. For example, in autonomous vehicles, the Chimera can process real-time sensor data (e.g., LiDAR, radar) while running control algorithms, enabling faster and more efficient decision-making.

The Chimera’s architecture is a hybrid between a conventional C++ programmed DSP (the world’s largest and most capable, with up to 32,768 bits of parallelism) and hardware accelerator for convolutions and matrix math. It features a single instruction dispatch feeding into a massively matrix-parallel execution pipeline with up to 1024 of processing elements, optimized for matrix and vector operations common in neural networks. Unlike traditional GPUs that rely solely on sequential instruction pipelines, Quadric’s processor can switch into and out of two modes of execution, traditional linear code flow, or a dedicated matrix/convolution that executes operations in a dataflow-driven manner, minimizing latency and maximizing throughput. This hybrid of DSP behavior plus Accelerator behavior is what earned the architecture the Chimera brand name – a processor with the DNA of two very different architectures merged into one processor pipeline. This approach delivers up to 10x better performance-per-watt compared to competing solutions, according to Quadric’s benchmarks. Additionally, its software stack, including the Quadric SDK and Chimera graph compiler, allows developers to program in familiar frameworks like TensorFlow or PyTorch, ensuring compatibility with existing ML models while optimizing them for the Chimera’s unique architecture.  Critically, the processor runs both C++ user code as well as user python code making the Chimera core far more flexible that competing hardwired accelerators.

Quadric’s impact extends across industries. In healthcare, its processors enable wearable devices to perform real-time diagnostics, such as detecting irregular heart rhythms, without relying on cloud connectivity, thus enhancing privacy and responsiveness. In industrial IoT, Quadric-powered sensors can analyze vibration or temperature data on-site, reducing latency and bandwidth costs. The automotive sector benefits from its ability to handle complex perception tasks in self-driving cars, where low power and high reliability are paramount. By processing AI workloads locally, Quadric’s technology also addresses privacy concerns, as sensitive data no longer needs to be transmitted to centralized servers—a growing priority in a data-conscious world.

Bottom line: Quadric is poised to shape the future of edge AI as devices become smarter and more autonomous. Its emphasis on general-purpose AI processing aligns with the trend toward heterogeneous computing, where no single processor type dominates. By enabling efficient, on-device intelligence, Quadric not only enhances performance but also democratizes AI deployment across resource-constrained environments. As edge AI demand grows to an estimated $70 billion market by 2030, Quadric’s innovative approach positions it as a key player in making intelligent systems ubiquitous, from smart homes to autonomous factories.

 Visit Quadric Inc. for more information.

Ready to bring intelligence to the edge? Connect with a Quadric SME to explore your project.

Also Read:

Legacy IP Providers Struggle to Solve the NPU Dilemna

Recent AI Advances Underline Need to Futureproof Automotive AI

2025 Outlook with Veerbhan Kheterpal of Quadric


Why IP Quality and Governance Are Essential in Modern Chip Design

Why IP Quality and Governance Are Essential in Modern Chip Design
by Admin on 10-30-2025 at 6:00 am

Why IP Quality and Governance are Essential in Modern Chip Design

By Kamal Khan

In today’s semiconductor industry, success hinges not only on innovation but also on discipline in managing complexity. Every system-on-chip (SoC) is built from hundreds of reusable IP blocks—standard cells, memories, interfaces, and analog components. These IPs are the foundation of the design. But if the foundation is weak, even the most ambitious architecture can fail.

This is where IP Quality and Governance in IPLM (Intellectual Property Lifecycle Management) become critical. They are not “nice to have” features. They are guardrails that protect design teams from costly errors, late rework, and unpredictable tape-outs.

The Value of IP Governance

Governance ensures that every IP block follows a clear lifecycle—moving from Development to Certification, then to Published, and eventually to Obsolete. At each step, policies define what must be true before an IP can move forward.

This isn’t about bureaucracy—it’s about consistency and trust. In a world where design teams are distributed across continents, governance makes sure that everyone is working from the same playbook.

Why Quality Rules Matter

IP Quality checks are the quality gates that enforce readiness. Instead of relying on subjective judgment, they apply automated rules and checklists:

  • Does this IP version pass all regression tests?
  • Is the CAD environment aligned with the correct process node?
  • Are mandatory properties and metadata captured?
  • Has it cleared security or licensing requirements?

The answer is binary: Pass or Fail. Only when an IP meets the defined criteria does it advance.

This objectivity reduces risk, improves confidence, and accelerates integration.

Real-World Examples

1. Safeguarding Tape-Out

A global SoC team nearly missed a major smartphone launch because a third-party IP was integrated at the wrong maturity level. The bug was caught late and fixing it cost both schedule and reputation. With IPLM governance, only Final-certified IPs would have been allowed into the design at tape-out, preventing the slip.

2. Ensuring Process Compatibility

An analog IP provider released an updated block for a 5nm process. Governance rules automatically checked the metal stack property. The rule flagged a mismatch before integration, saving weeks of debug time.

3. Enabling Global Collaboration

A memory IP updated in the U.S. was inadvertently used by an Asia-based team before it was production-ready. With IPLM, policies enforce semantic versioning and access control, ensuring that immature IPs stay restricted until they’re validated.

The Bigger Picture

The cost of a re-spin can run into the tens of millions. The cost of late discovery can be even higher: lost market windows.

By embedding IP Quality and Governance into the design process, organizations gain:

  • Predictability: Designs progress with fewer surprises.
  • Traceability: Every decision, rule, and approval is logged.
  • Scalability: Teams across geographies work with the same trusted data.

Closing Thought

In an era where semiconductor complexity grows faster than design cycles, discipline is the new differentiator. IP Quality and Governance in IPLM aren’t just technical features—they are strategic enablers of faster, safer, and smarter innovation.

Watch the Free Demo

Talk to an Expert

Kamal Khan is Perforce Vice President North America Automotive/Semiconductor. He has over 20 years of domestic and international experience, specializing in PLM, Data Management, IP lifecycle management, IoT Security, Semiconductors, Enterprise software, EDA, CAD, 3D Printing, Cloud solutions.

This article was originally published on Perforce.com. For more information on how Perforce IPLM streamlines semiconductor development , visit https://www.perforce.com/products/helix-iplm

Also Read:

IPLM Today and Tomorrow from Perforce

Perforce and Siemens at #62DAC

Perforce Webinar: Can You Trust GenAI for Your Next Chip Design?


U.S. Electronics Production Growing

U.S. Electronics Production Growing
by Bill Jewell on 10-29-2025 at 2:00 pm

U.S. Electronics Production Growing 1

U.S. electronics production has been on an accelerating growth trend over the last ten months. Three-month average change versus a year ago (3/12 change) has increased from 0.4% in October 2024 to 6.2% in August 2025. Japan’s 3/12 change has been positive since November 2024, but has been decelerating for most of 2025, reaching 1.1% in August. The 27 countries of the European Union (EU 27) have mostly experienced negative 3/12 change, with July 2025 at -1.8%. UK 3/12 change turned positive in July and reached 1.8% in August.

The accelerating growth in U.S. electronics production is partially due to U.S. and foreign companies shifting manufacturing to the U.S. from other countries. The shift is primarily due to the Trump administration’s tariffs – both implemented and threatened. However, the manufacturing shift has not led to increased employment in electronics manufacturing. According to the U.S. Bureau of Labor Statistics, U.S. employment in electronics manufacturing has declined from 1.04 million jobs in January 2024 to 1.001 million jobs in August 2025. August is the latest available month due to the government shutdown. In the last ten years, employment has been in a narrow range of 1.001 million to 1.062 million jobs.

Electronics production in key Asian countries has been volatile but generally growing over the last two years. The shift of electronics manufacturing to the U.S. has not had any noticeable impact on China. China electronics production has shown steady growth with 3/12 change in the range of 10% to 13% since January 2024. China’s 3/12 change in September 2025 was 10.7%. South Korea’s 3/12 change has moderated to 7.4% in July 2025 from 17% in May. Malaysia and Vietnam had similar trends in the last six months, with 3/12 change moderating from the 8% to 10% range in March through May 2025 to the 5% to 6% range in June through August. India has been volatile, with 3/12 change peaking at 15% in April 2025 and falling to zero in July. India’s August 3/12 change was 1.9%.

As noted in our June 2025 newsletter, U.S. imports of smartphones have dropped significantly beginning in April 2025. The Trump administration stated in April 2025 that smartphones were currently exempt from tariffs but would be subject to tariffs in “a month or two.” Six months later, no smartphone tariffs have been announced. Imports remained low based on the latest data available through July. Due to the U.S. government shutdown, data for August in not available. However, China’s export data is available through September 2025. Apparently, China has a functioning government. In September, China’s exports of smartphones to the U.S. increased sharply to 2.26 million units from 1.02 million units in August, an increase of 121%. The average unit price (AUP) of these smartphones almost doubled from $702 in August to $1,387 in September. These increases coincide with Apple’s introduction of its iPhone 17 models in September 2025. Apple and other smartphone companies have apparently been limiting imports to the U.S. due to the tariff threat, but increased imports in September to support the release of new models.

The Trump administration’s tariff policy may have contributed to increased U.S. electronics production, but it has not led to new jobs in the industry. The current tariffs and potential tariffs continue to cause uncertainty in the U.S. and global electronics industry.

Bill Jewell
Semiconductor Intelligence, LLC
billjewell@sc-iq.com

Also Read:

Semiconductor Equipment Spending Healthy

Semiconductors Still Strong in 2025

U.S. Imports Shifting


Failure Prevention with Real-Time Health Monitoring: A proteanTecs Innovation

Failure Prevention with Real-Time Health Monitoring: A proteanTecs Innovation
by Daniel Nenni on 10-29-2025 at 10:00 am

proteanTecs RTHM OIP 2025

In the complex world of semiconductors, reliability, availability, and serviceability (RAS) have become paramount, especially as devices shrink to nanoscale geometries like 2nm. At the recent 2025 TSMC OIP Forum Noam Brousard, VP of Solutions Engineering at proteanTecs, presented “Failure Prevention with Real-Time Health Monitoring (RTHM™),” highlighting how modern electronics face unprecedented challenges. From smaller architectures and high-performance workloads to hyper-competition and cost pressures, these factors contribute to functional failures, silent data corruption, and system-wide errors. As hardware must endure longer lifecycles, often 4-6 years, without refresh, the risk of failures escalates, particularly in large-scale AI systems where devices operate at lower voltages and under unpredictable demands.

Silent data corruption (SDC) emerges as an insidious threat. Unlike detectable errors, SDC stems from untraceable hardware failures that evade exception mechanisms and system logs. It propagates undetected, causing cascading issues that demand extensive root-cause analysis. In AI-driven environments, SDC can yield incorrect outputs, faulty decisions, and parameter corruption in models, with catastrophic implications for critical applications. Brousard cited real-world examples underscoring SDC’s rise. Meta reported miscalculated mathematical operations in defective CPUs leading to database losses, where a file decompression error produced zero instead of 156. Alibaba Cloud encountered checksum mismatches in storage apps due to intermittent processor faults. Google noted manufacturing defects exposed by rare instructions in low-level libraries, while other cases involved incorrect hashing and cache coherence issues. Studies from Google, Meta, Facebook, and Alibaba reveal that approximately one in a thousand machines in large fleets suffers from SDC, emphasizing its prevalence in production CPU populations.

Traditional approaches fall short. Built-in self-test (BIST) integrations are complex and expensive, running only at startup with slow responses and no precise location pinpointing. Hardware and software checks often react post-failure, lacking the granularity needed for proactive intervention.

proteanTecs’ RTHM, part of their comprehensive lifecycle solutions spanning power/performance optimization, reliability monitoring, functional safety, chip and system production, and advanced packaging. RTHM shifts the paradigm from error containment to failure avoidance by providing electronics visibility from within. It employs on-chip Agents for high-coverage, continuous monitoring of actual performance-limiting paths, both at test and in mission mode. These Agents sample high-speed clocks in real paths, adhering to power-performance-area (PPA) constraints, and are sensitive to workload stress, latent defects, operating conditions, DC IR drops, local Vdroops, hot spots, and aging.

A key feature is the Performance Index, an event-based algorithm that aggregates timing margin measurements across thresholds, affected areas, clock/power domains, and prior events. Analyzed per logical unit, PI delivers an integrated score reflecting issue severity—how close a device is to failure. Visualized as a percentage (e.g., 79%), it enables operators to act before problems escalate.

Without RTHM, failures manifest after escalation, complicating root causes and incurring costly downtime. With it, potential issues are identified and mitigated preemptively, yielding faster, accurate, cost-effective predictions. This proactive stance avoids functional failures, prevents SDC, and eliminates system-wide errors. RTHM offers accurate fault detection at the circuit level, reliability monitoring for intrinsic/extrinsic faults, and unmatched resiliency to halt error propagation.

Bottom line: As semiconductors push boundaries, RTHM represents a transformative tool. By embedding intelligence directly into chips, it empowers engineers to predict and avert failures, safeguarding operations in an era of scale and complexity.

For more, contact proteanTecs.

Also Read:

Podcast EP313: How proteanTecs Optimizes Production Test

Thermal Sensing Headache Finally Over for 2nm and Beyond

DAC News – proteanTecs Unlocks AI Hardware Growth with Runtime Monitoring


Podcast EP314: An Overview of Toshiba’s Strength in Power Electronics with Jake Canon

Podcast EP314: An Overview of Toshiba’s Strength in Power Electronics with Jake Canon
by Daniel Nenni on 10-29-2025 at 8:00 am

Daniel is joined by Jake Canon, senior business development engineer at Toshiba America Electronic Components. Jake is an enthusiastic contributor to the semiconductor industry and has been working closely with engineers to find new discrete power solutions for a wide variety of cutting-edge applications.

Dan explores the 150-year history of Toshiba with Jake, who focuses on Toshiba’s advanced work in power electronics. Jake describes the work Toshiba has done and is doing with low voltage power MOSFETs. He explains that Toshiba is on its 11th generation of these devices. He provides excellent detail on the innovations that have been achieved in both device and packaging and the broad range of applications supported such as automotive and AI. Jake also describes the worldwide manufacturing footprint of Toshiba.

The views, thoughts, and opinions expressed in these podcasts belong solely to the speaker, and not to the speaker’s employer, organization, committee or any other group or individual.


Inference Acceleration from the Ground Up

Inference Acceleration from the Ground Up
by Lauro Rizzatti on 10-29-2025 at 6:00 am

VSORA AI CHip

VSORA, a pioneering high-tech company, has engineered a novel architecture designed specifically to meet the stringent demands of AI inference—both in datacenters and at the edge. With near-theoretical performance in latency, throughput, and energy efficiency, VSORA’s architecture breaks away from legacy designs optimized for training workloads.

The team behind VSORA has deep roots in the IP business, having spent years designing, testing, and fine-tuning their architecture. Now in its fifth generation, the architecture has been rigorously validated and benchmarked over the past two years in preparation for silicon manufacturing.

Breaking the Memory Wall

The “memory wall” has challenged chip designers since the late 1980s. Traditional architectures attempt to mitigate the impact on performance induced by data movement between external memory and processing units by layering memory hierarchies, such as multi-layer caches, scratchpads, and tightly coupled memory, each offering tradeoffs between speed and capacity.

In AI acceleration, this bottleneck becomes even more pronounced. Generative AI models, especially those based on incremental transformers, must constantly reprocess massive amounts of intermediate state data. Conventional architectures struggle here. Every cache miss—or any operation requiring access outside in-memory compute—can severely degrade performance.

VSORA tackles this head-on by collapsing the traditional memory hierarchy into a single, unified memory stage: a massive SRAM array that behaves like a flat register file. From the perspective of the processing units, any register can be accessed anywhere, at any time, within a single clock. This eliminates costly data transfers and removes the bottlenecks that hamper other designs.

A New AI Processing Paradigm: 16 Million Registers per Core

At the core of the VSORA’s architecture is a high-throughput computational tile consisting of 16 processing cores. Each core integrates 64K multi-dimensional matrix multiply–accumulate (MAC) units, scalable from 2D to arbitrary N-dimensional tensor operations, alongside eight high-efficiency digital signal processing (DSP) cores. Numerical precision is dynamically configurable on a per-operation basis, ranging from 8-bit fixed-point to 32-bit floating-point formats. Both dense and sparse execution modes are supported, with runtime-selectable sparsity applied independently to weights or activations, enabling fine-grained control of computational efficiency and inference performance.

Each core incorporates an unprecedented 16 million registers, orders of magnitude higher than the few hundred to few thousand typically found in conventional architectures. While such a massive register file would normally challenge traditional compiler designs, VSORA overcomes these with two architectural innovations:

  1. Native Tensor Processing: VSORA’s hardware natively supports vector, tensor, and matrix operations, removing the need to decompose them into scalar instructions. This eliminates the manual implementation of nested loops often required in GPU environments such as CUDA, thereby improving computational efficiency and reducing programming complexity.
  2. High-Level Abstraction: Developers program at a high level using familiar frameworks, such as PyTorch and ONNX for AI workloads, or Matlab-like functions for DSP, without the need to write low-level code or manage registers directly. This abstraction layer streamlines development, enhances productivity, and maximizes hardware utilization.

Chiplet-Based Scalability

VSORA’s physical implementation leverages a chiplet architecture, with each chiplet comprising two VSORA computational tiles. By combining VSORA chiplets with high-bandwidth memory (HBM) chiplet stacks, the architecture enables efficient scaling for both cloud and edge inference scenarios.

  • Datacenter-Grade Inference. The flagship Jotunn8 configuration pairs eight VSORA chiplets with eight HBM3e chiplets, delivering an impressive 3,200 TFLOPS of compute performance in FP8 dense mode. This configuration is optimized for large-scale inference workloads in datacenters.
  • Edge AI Configurations. For edge deployments, where memory requirements are lower, VSORA offers:
    • Tyr2: Two VSORA chiplets + one HBM chiplet = 800 TFLOPS
    • Tyr4: Four VSORA chiplets + one HBM chiplet = 1,600 TFLOPS

These configurations empower efficient tailoring of compute and memory resources to suit the constraints of edge applications.

Power Efficiency as a Side Effect

The performance gains are evident, but equally remarkable are the advances in processing and power efficiency.

Extensive pre-silicon validation using leading large language models (LLMs) across multiple concurrent workloads demonstrated processing efficiencies exceeding 50%, that’s an order of magnitude higher than state-of-the-art GPU-based designs.

In terms of energy efficiency, the Jotunn8 architecture consistently delivers twice the performance-per-watt of comparable solutions. In practical terms, its power draw is limited to approximately 500 watts, compared to more than one kilowatt for many competing accelerators.

Collectively, these innovations yield multiple times higher effective performance at less than half the power consumption, translating to an overall system-level advantage of 8–10× over conventional implementations.

CUDA-Free Compilation Simplifies Algorithmic Mapping and Accelerate Deployment

One of the often-overlooked advantages of the VSORA architecture lies in its streamlined and flexible software stack. From a compilation perspective, the flow is dramatically simplified compared to traditional GPU environments like CUDA.

The process begins with a minimal configuration file of just a few lines that defines the target hardware environment. This file enables the same codebase to execute across a wide range of hardware configurations, whether that means distributing workloads across multiple cores, chiplets, full chips, boards, or even across nodes in a local or remote cloud. The only variable is execution speed; the functional behavior remains unchanged. This makes on-premises and localized cloud deployments seamless and scalable.

A Familiar Flow, Without the Complexity

Unlike CUDA-based compilation processes, the VSORA flow appears reassuringly basic without the layers of manual tuning and complexity. Traditional GPU environments often require multiple painful optimization steps that, when successful, can deliver strong performance, but are fragile and time-consuming. VSORA simplifies this through a more automated and hardware-agnostic compilation approach.

The flow begins by ingesting standard AI inputs, such as models defined in PyTorch. These are processed by VSORA’s proprietary graph compiler, which automatically performs essential transformations such as layer reordering or slicing for optimal execution. It extracts weights and model structure and then outputs an intermediate C++ representation.

This C++ code is then fed into an LLVM-based backend, which identifies the compute-intensive portions of the code and maps them to the VSORA architecture. At this stage, the system becomes hardware-aware, assigning compute operations to the appropriate configuration—whether it’s a single VSORA tile, a TYR4 edge device, a full Jotunn8 datacenter accelerator, a server, a rack or even multiple racks in different locations.

Invisible Acceleration for Developers

From a developer’s point of view, the VSORA accelerator is invisible. Code is written as if it targets the main processor. During compilation, the compilation flow identifies the code segments best suited for acceleration and transparently handles the transformation and mapping to VSORA hardware. This significantly lowers the barrier for adoption, requiring no low-level register manipulation or specialized programming knowledge.

VSORA’s instruction set is high-level and intuitive, carrying over rich capabilities from its origins in digital signal processing. The architecture supports AI-specific formats such as FP8 and FP16, as well as traditional DSP operations like FP16 arithmetic, all handled automatically on a per-layer basis. Switching between modes is instantaneous and requires no manual intervention.

Pipeline-Independent Execution and Intelligent Data Retention

A key architectural advantage is pipeline independence—the ability to dynamically insert or remove pipeline stages based on workload needs. This gives the system a unique capacity to “look ahead and behind” within a data stream, identifying which information must be retained for reuse. As a result, data traffic is minimized, and memory access patterns are optimized for maximum performance and efficiency, reaching levels unachievable in conventional AI or DSP systems.

Built-In Functional Safety

To support mission-critical applications such as autonomous driving, VSORA integrates functional safety features at the architectural level. Cores can be configured to operate in lockstep mode or in redundant configurations, enabling compliance with strict safety and reliability requirements.

Conclusion

VSORA is not retrofitting old designs for modern inference needs, instead it’s building from the ground up. With a memory architecture that eliminates traditional bottlenecks, compute units tailored for tensor operations, and unmatched power efficiency, VSORA is setting a new standard for AI inference—whether in the cloud or at the edge.

Also Read:

The Rise, Fall, and Rebirth of In-Circuit Emulation (Part 1 of 2)

The Rise, Fall, and Rebirth of In-Circuit Emulation: Real-World Case Studies (Part 2 of 2)

Silicon Valley, à la Française


AI-Driven DRC Productivity Optimization: Revolutionizing Semiconductor Design

AI-Driven DRC Productivity Optimization: Revolutionizing Semiconductor Design
by Daniel Nenni on 10-28-2025 at 10:00 am

AI Driven DRC Productivity Optimization Siemens AMD TSMC

The semiconductor industry is undergoing a transformative shift with the integration of AI into DRC workflows, as showcased in the Siemens EDA presentation at the 2025 TSMC OIP. Titled “AI-Driven DRC Productivity Optimization,” this initiative, led by Siemens EDA’s David Abercrombie alongside AMD’s Stafford Yu and GuoQin Low, highlights a collaborative effort to enhance productivity and efficiency in chip design. The presentation outlines a comprehensive AI system that revolutionizes the entire EDA workflow, from knowledge sharing to automated fixing and debugging.

At the core of this innovation is the Siemens EDA AI System, which leverages a GenAI interface, knowledge base, and data lake to integrate AI tools across the portfolio. This system, deployable on customer hardware or cloud environments, supports a unified installation process and offers flexibility to incorporate customer data and models. Tools like the AI Docs Assistant and Calibre RVE Check Assist boost user understanding by providing instant answers and leveraging TSMC design rule data, respectively. The AI Docs Assistant, accessible via browser or integrated GUIs, uses retrieval-augmented generation to deliver relevant citations, while Calibre RVE Check Assist enhances debugging with specialized images and descriptions from TSMC.

Collaboration is a key pillar, with features like Calibre RVE Check Assist User Notes enabling in-house knowledge sharing. Designers can capture fixing suggestions and images, creating a shared knowledge base that enhances DRC-fixing flows across organizations. Meanwhile, Calibre DesignEnhancer automates the resolution of DRC violations on post-routed designs, using analysis-based modifications to insert sign-off DRC-clean interconnects and vias. This tool’s ability to handle complex rules and dependencies makes it a standalone DRC fixing solution.

Calibre Vision AI addresses the unique challenges of full-chip integration by offering AI-guided DRC analysis. It provides lightning-fast navigation through billions of errors, intelligent debug clustering, and cross-user collaboration tools like bookmarks and HTML reports. AMD’s testimonial underscores a 2X productivity boost in systematic error debugging, with Vision AI reducing OASIS database sizes and load times significantly. Signals analysis, such as identifying fill overlaps with clock cells or CM0 issues in breaker cells, accelerates root-cause identification.

This AI-driven approach, bolstered by AMD and TSMC collaborations, optimizes DRC sign-off productivity by boosting workflows, understanding, fixing, debugging, and collaboration. As the industry moves toward more complex designs, Siemens EDA’s AI system sets a new standard, promising faster cycle times and enhanced design robustness, paving the way for future innovations in semiconductor technology.

For more information contact Siemens EDA

Great presentation, absolutely.

Also Read:

Visualizing hidden parasitic effects in advanced IC design 

Protect against ESD by ensuring latch-up guard rings

Something New in Analog Test Automation


Emulator-Like Simulation Acceleration on GPUs. Innovation in Verification

Emulator-Like Simulation Acceleration on GPUs. Innovation in Verification
by Bernard Murphy on 10-28-2025 at 6:00 am

Innovation New

GPUs have been proposed before to accelerate logic simulation but haven’t quite met the need yet. This is a new attempt based on emulating emulator flows. Paul Cunningham (GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and lecturer at Stanford, EE292A) and I continue our series on research ideas. As always, feedback welcome.

The Innovation

This month’s pick is GEM: GPU-Accelerated Emulator-Inspired RTL Simulation. The authors are from Peking University, China, and NVIDIA. The paper was presented at DAC 2025 and has no citations so far.

There have been previous attempts to accelerate logic simulation using GPU hardware, which have apparently foundered on a poor match between the heterogenous nature of logic circuit activity and the SIMT architecture of GPUs. This paper proposes a new approach, modeled on FPGA-based emulators/prototypers and supported by a very long instruction word architecture. It claims impressive speedup over CPU-based simulation.

This paper is from NVIDIA and Peking University and was presented at DAC2025 this year.

Paul’s view

Very interesting paper this month from NVIDIA Research and Peking University.  It takes a fresh look at accelerating logic simulation on GPUs, something Cadence has invested heavily in since acquiring Rocketick in 2016. With the explosion in GPU computing for AI, customer motivation to use GPUs to accelerate simulation is even higher, and we are doubling down our efforts in this area.

An NVIDIA GPU is a massive single-instruction-multiple-thread (SIMT) machine. To harness its power requires mapping a circuit to a large number of threads that each execute the same underlying program with minimal inter-thread communication. The key to doing this is intelligent replication and intelligent partitioning of logic cones across threads. Replication reduces inter-thread communication: rather than computing shared fan-in of multiple logic cones in one thread and passing the result to other threads, just have threads for each logic cone replicate compute for that shared fan-in. Smart partitioning ensures that thread processors are well utilized: we don’t want thread processors executing very deep logic cones to leave other thread processors idle that executed short paths.

In this paper, the authors synthesize a circuit to an AND-Inverter graph. To mitigate the problem of a few deep logic cones bottlenecking parallelization, they introduce a “boomerang” partitioner. This partitioner aims to balance the fan-in width of each partition rather than the gate count of each partition. Each partition is then mapped to a bit-packed structure that can be batch loaded from memory and executed on an NVIDA GPU very efficiently. This bit-packed structure uses a 32 bit integer AND instruction followed by a 32 bit XOR with a mask instruction to perform 32 AND-INVERT operations in one shot, with all thread processors executing this same simple program.

The authors benchmark their solution, GEM, on 5 different open source designs ranging in size from 670k to 5.5M gates. Comparing GEM on an NVIDIA A100 to a “commercial” RTL logic simulator running on a single core of a Intel Xeon 6136 (Skylake), GEM runs on average 20x faster on the smallest design to 2.5x faster on the largest design. Impressive!

Raúl’s view

CPU-based RTL simulators are relatively slow and poorly scalable, while FPGA-based emulators are fast but expensive to set-up and inflexible. The heterogeneous, irregular nature of digital circuits conflicts with GPUs’ SIMT (Single Instruction, Multiple Thread) architecture. GEM (GPU accelerated emulator) overcomes these challenges yielding the following results on commodity GPUs: an average of 6x speed-up over 8-threaded Verilator and 9x over a leading commercial simulator. The system is open sourced under Apache 2.0.

GEM’s main innovations are the “boomerang executor layer” which handles intra-block efficiency (how logic is executed inside a GPU thread block), and the partitioning flow which handles inter-block scalability (how a large circuit is divided into many pieces that can be simulated in parallel on thousands of GPU cores).

Mapping logic circuits to a GPU, a common method is levelization: divide the circuit into “logic levels” so that gates at the same depth can be computed in parallel. But real circuits have many levels with only a few gates. In GPU kernels running in SIMT fashion each level would trigger a global synchronization, and most GPU threads would be idle most of the time. The result is poor GPU utilization and a large synchronization overhead. Instead, in GEM each GPU thread block (representing one circuit partition) maintains a set of bits (8192 circuit states) in shared memory and the boomerang layer executes logic across multiple levels (14 levels) in one pass. It processes these bits in a recursive, folded structure: pairs of bits A and B of the circuit state are repeatedly combined with an external constant C using bitwise logic operations:

r = (A AND B) XOR C

Conceptually, this is like “folding” the bit vector in half multiple times—each fold represents several logic levels being collapsed into one operation. It is performed in parallel across 32-bit words enabling word-level parallelism in addition to thread-level parallelism. These foldings are repeated 14 times until one resulting bit represents the output of a deep cone of logic. This “boomerang” execution pattern effectively computes the equivalent of 10-15 logic levels; because all operations happen within a thread block, synchronization is local, avoiding costly global GPU synchronizations. The boomerang shape corresponds to how logic density changes across circuit depth: many gates at shallow levels (wide part of the boomerang), few gates at deeper levels (narrow part).

The partitioning flow deals with: 1) inter-block dependencies (If two circuit partitions depend on each other’s outputs within a simulation cycle, you’d need global synchronization every time step, killing performance) and 2) replication cost; because it is based on a known algorithm called RepCut (Replication-Aided Partitioning) to reduce these dependencies by duplicating some logic across partitions so each partition can simulate independently. RepCut, was designed for 10s of CPU threads, not for hundreds or thousands of GPU thread blocks, so using it directly, the amount of duplicated logic grows to over 200% duplication for just 200+ partitions. Instead of cutting the entire circuit into hundreds of partitions in one go, GEM performs multi-stage RepCut: splitting large designs in stages, minimizing replication, aligning partition size to GPU architecture constraints (boomerang width), merging intelligently to ensure efficient GPU occupancy. Introducing one additional synchronization point between stages. This reduces replication to <3% for a 500K-gate circuit partitioned into 216 blocks.

GEM’s innovation lies in its emulator-inspired abstraction that maps circuit logic to GPU execution through a VLIW architecture and highly local memory access. The mapping flow borrows from traditional EDA synthesis, placement, and partitioning logic. It achieves high simulation density and GPU efficiency, outperforming multi-threaded Verilator (6x), commercial CPU-based tools (9x) and previous GPU approaches (8x) across design types ranging from RISC-V CPUs to AI accelerators. Keep in mind that this is 2 state bit level simulation without the bells and whistles of a commercial simulator.

GEM combines EDA methods with GPU computing in a software-only, open-source package compatible with standard GPUs, which makes it appealing. While described as RTL level, it operates at the bit level like FPGA emulators. GEM currently lacks multi-GPU support, 4-state logic, arithmetic modeling, and event-driven pruning, requiring further development to potentially become a competitive simulation alternative.

Also Read:

Cadence’s Strategic Leap: Acquiring Hexagon’s Design & Engineering Business

Cocotb for Verification. Innovation in Verification

A Big Step Forward to Limit AI Power Demand