SILVACO 073125 Webinar 800x100

Podcast EP276: How Alphawave Semi is Fueling the Next Generation of AI Systems with Letizia Giuliano

Podcast EP276: How Alphawave Semi is Fueling the Next Generation of AI Systems with Letizia Giuliano
by Daniel Nenni on 02-28-2025 at 10:00 am

Dan is joined by Letizia Giuliano, Vice President of Product Marketing and Management at Alphawave Semi. She specializes in architecting cutting-edge solutions for high-speed connectivity and chiplet design architecture. Prior to her role at Alphawave Semi, Letizia held the position of Product Line Manager at Intel, where she facilitated the integration of complex IP for external customers, as well as within Intel’s graphics and CPU products. With a background in Electrical Engineering, Letizia has contributed significantly to her field through technical papers, presentations at conferences and her involvement in defining industry standards like OpenHBI and UCIe.

Dan explores the unique and demanding requirements for next generation systems with Letizia. The need for a platform approach that addresses high-performance connectivity requirements is discussed. The role of advanced interface support through IP, chiplets and custom silicon is examined with respect to the need to scale up and scale out new systems with higher quality, reliability and shorter time to market.

Letizia describes the broad offerings Alphawave Semi is bringing to market to address these challenges. The current and future impact of this technology is explored.

The views, thoughts, and opinions expressed in these podcasts belong solely to the speaker, and not to the speaker’s employer, organization, committee or any other group or individual.

 


CEO Interview: Dr. Andreas Kuehlmann of Cycuity

CEO Interview: Dr. Andreas Kuehlmann of Cycuity
by Daniel Nenni on 02-28-2025 at 8:00 am

Andreas 2022 Headshot cropped (2)

Dr. Andreas Kuehlmann, Executive Chairman and CEO at Cycuity, has spent his career across the fields of semiconductor design, software development, and cybersecurity. Prior to joining Cycuity, he helped build a market-leading software security business as head of engineering at Coverity and, after its acquisition by Synopsys, as General Manager of the newly formed Software Integrity business unit. In that role he led its growth from double-digit-millions to a multi-hundred-million-dollar business. He also previously worked at IBM Research and Cadence Design Systems, where he made influential contributions to hardware verification. Dr. Kuehlmann served as an adjunct professor at UC Berkeley’s Department of Engineering and Computer Science for 14 years, and received a Ph.D. in Electrical Engineering from the Technical University Ilmenau, Germany.

Tell us about your company?

Cycuity provides software products and services to specify and verify semiconductor device security. We help customers to ensure that security weaknesses are identified and mitigated during the design phase before manufacturing. Our security solutions are a critical element in the semiconductor product ecosystem for commercial and defense industries. They provide the broadest security assurance for semiconductor development across the design supply chain from secure usage of third-party IPs (3PIP) to full chips, including firmware. Cycuity’s products fit smoothly into existing design flows and utilize the simulation and emulation products of all three EDA vendors: Synopsys, Cadence, and Siemens EDA. Furthermore, our technology is applied to perform advanced security assessments of legacy hardware components in existing systems.

What problems are you solving?

Security threats in modern hardware systems are complex, rapidly evolving, and often overlooked during the early stages of design. The Radix platform directly addresses these challenges by identifying security weaknesses and unexpected behaviors early in the chip design lifecycle, minimizing the risk of escapes that lead to potential exploits. Traditional verification tools frequently fall short in providing complete security coverage across hardware, firmware, and software. Radix closes this gap by delivering a comprehensive security verification solution that spans the entire system from block level to software.

Radix’s systematic approach allows teams to develop security measures effectively and document their proper functioning with full transparency. Radix transforms security assurance from a fragmented and reactive process into a proactive, scalable, and fully traceable solution.

What application areas are your strongest?

We excel in delivering quantifiable assurance for semiconductors used in critical applications across industries, especially for high-stakes applications in defense, automotive, and IoT where security and reliability are paramount.

What keeps your customers up at night?

Our customers are concerned about ensuring the security and resilience of their semiconductor chips and embedded systems. What keeps them up at night is the thought of receiving a call from one of their customers reporting a security vulnerability in a chip that is broadly deployed in many products. Besides the impact on their brand, the cost of remediating a hardware security flaw can be extremely high. Moreover, customers are concerned about delivering secure semiconductors which comply with increasingly stringent industry standards. We address these challenges head-on by providing quantifiable assurance and robust security practices. Our solutions empower customers to achieve confidence in their designs, so they can focus on innovation without compromising on security.

What does the competitive landscape look like and how do you differentiate?

Cycuity is uniquely positioned as a thought leader and innovator of hardware security solutions. We have demonstrated our commitment to the development of secure and resilient microelectronics for defense and commercial applications. Our Radix platform goes beyond the typical “pass or fail” checks. It offers unmatched security design support through advanced security exploration and analysis capabilities, as well as scalable and traceable security verification – helping to more effectively and efficiently achieve security signoff and prove compliance.

What new features/technology are you working on?

We’ve got some exciting new features coming soon—check back next month for the details. For now, we can share a bit about Radix’s unique exploration capabilities which help security and verification teams to better understand chip designs from a security perspective. Unlike functional security verification, which is aimed at ensuring that a required set of security features are correctly implemented, security exploration is focused on investigating unknown or unintended side effects of security functions that are not entirely understood but could lead to security weaknesses or vulnerabilities. Security exploration with Radix offers powerful analysis and graphical visualization capabilities to reveal unexpected security behaviors that cannot be observed with traditional design tools. Even if the unexpected behavior turns out to be benign, fully  analyzing and deeply understanding it serves as a powerful confirmation of the overall design intent.

How do customers normally engage with your company?

Many customers come to us with the need of building a comprehensive chip security program, often starting from scratch. Security is not like flipping a switch or using a software product. It is rather a journey on which we help customers to progress starting with training, security requirement development, tool selection, integration and production ramping to documentation and signing off security for manufacturing.

Talk to a Security Expert

Also Read:

Cycuity at the 2024 Design Automation Conference

Hardware Security in Medical Devices has not been a Priority — But it Should Be


The Double-Edged Sword of AI Processors: Batch Sizes, Token Rates, and the Hardware Hurdles in Large Language Model Processing

The Double-Edged Sword of AI Processors: Batch Sizes, Token Rates, and the Hardware Hurdles in Large Language Model Processing
by Lauro Rizzatti on 02-27-2025 at 10:00 am

Accelerated,Computing, ,Parallel,Processing,To,Speed,Up,Work,On

Unlike traditional software programming, AI software modeling represents a transformative paradigm shift, reshaping methodologies, redefining execution processes, and driving significant advancements in AI processors requirements.

Software Programming versus AI Modeling: A Fundamental Paradigm Shift

Traditional Software Programming
Traditional software programming is built around crafting explicit instructions (code) to accomplish specific tasks. The programmer establishes the software’s behavior by defining a rigid set of rules, making this approach ideal for deterministic scenarios where predictability and reliability are paramount. As tasks become more complex, the codebase often grows in size and complexity.

When updates or changes are necessary, programmers must manually modify the code—adding, altering, or removing instructions as needed. This process provides precise control over the software but limits its ability to adapt dynamically to unforeseen circumstances without direct intervention from a programmer.

AI Software Modeling
AI software modeling represents a fundamental shift in how to approach problem solving. AI software modeling enables systems to learn patterns from data through iterative training. During training, AI analyzes vast datasets to identify behaviors, then applies this knowledge in the inference phase to perform tasks like translation, financial analysis, medical diagnosis, and industrial optimization.

Using probabilistic reasoning, AI makes predictions and decisions based on probabilities, allowing it to handle uncertainty and adapt. Continuous fine-tuning with new data enhances accuracy and adaptability, making AI a powerful tool for solving complex real-world challenges.

The complexity of AI systems lies not in the amount of written code but in the architecture and scale of the models themselves. Advanced AI models, such as large language models (LLMs), may contain hundreds of billions or even trillions of parameters. These parameters are processed using multidimensional matrix mathematics, with precision or quantization levels ranging from 4-bit integers to 64-bit floating-point calculations. While the core mathematical operations, namely, multiplications and additions (MAC), are rather simple, they are performed millions of times across large datasets with all parameters processed simultaneously during each clock cycle.

Software Programming versus AI Modeling: Implications on Processing Hardware

Central Processing Unit (CPU)
For decades, the dominant architecture used to execute software programs has been the CPU, originally conceptualized by John von Neumann in 1945. The CPU processes software instructions sequentially—executing one line of code after another—limiting its speed to the efficiency of this serial execution. To improve performance, modern CPUs employ multicore and multi-threading architectures. By breaking down the instruction sequence into smaller blocks, these processors distribute tasks across multiple cores and threads, enabling parallel processing. However, even with these advancements, CPUs remain limited in their computational power, lacking the enormous parallelism required to process AI models.

The most advanced CPUs achieve computational power of a few GigaFLOPS and feature memory capacities reaching a few TeraBytes in high-end servers, with memory bandwidths peaking at 500 GigaBytes per second.

AI Accelerators
Overcoming CPU limitations requires a massively parallel computational architecture capable of executing millions of basic MAC operations on vast amounts of data in a single clock cycle.

Today, Graphics Processing Units (GPUs) have become the backbone of AI workloads, thanks to their unparalleled ability to execute massively parallel computations. Unlike CPUs, which are optimized for general-purpose tasks, GPUs prioritize throughput, delivering performance in the range of petaFLOPS—often two orders of magnitude higher than even the most powerful CPUs.

However, this exceptional performance comes with trade-offs, particularly depending on the AI workload: training versus inference. GPUs can experience efficiency bottlenecks when handling large datasets, a limitation that significantly impacts inference but is less critical for training. LLMs like GPT-4, OpenAI’s o1/o3, Llama 3-405B, and DeepSeek-V3/R1 can dramatically reduce GPU efficiency. A GPU with a theoretical peak performance of one petaFLOP may deliver only 50 teraFLOPS when running GPT-4. While this inefficiency is manageable during training, where completion matters more than real-time performance, it becomes a pressing issue for inference, where latency and power efficiency are crucial.

Another major drawback of GPUs is their substantial power consumption, which raises sustainability concerns, especially for inference in large-scale deployments. The energy demands of AI data centers have become a growing challenge, prompting the industry to seek more efficient alternatives.

To overcome these inefficiencies, the industry is rapidly developing specialized AI accelerators, such as application-specific integrated circuits (ASICs). These purpose-built chips offer significant advantages in both computational efficiency and energy consumption, making them a promising alternative for the next generation of AI processing. As AI workloads continue to evolve, the shift toward custom hardware solutions is poised to reshape the landscape of artificial intelligence infrastructure. See Table I.

Attributes Software Programming AI Software Modeling
Application Objectives Deterministic and Targeted Tasks PredictiveAI and GenerativeAI
Flexibility/Adaptability Rule-based and Rigid Data-driven Learning and Evolving
SW Development Specific Programming Languages Data Science, ML, SW Engineering
Processing Method Sequential Processing Non-linear, Heavily Parallel Processing
Processor Architecture CPUs GPUs and Custom ASICs

Table I summarizes the main differences between traditional software programming vis-à-vis AI software modeling.

Source: VSORA

Key and Unique Attributes of AI Accelerators

The massively parallel architecture of AI processors possesses distinct attributes not found in traditional CPUs. Specifically, two key metrics are crucial for the accelerator’s ability to deliver the performance required to process AI workloads, such as LLMs: batch sizes and token throughput. Achieving target levels for these metrics presents engineering challenges.

Batch Sizes and the Impact on Accelerator Efficiency

Batch size refers to the number of independent inputs or queries processed concurrently by the accelerator.

Memory Bandwidth and Capacity Bottlenecks

In general, larger batches improve throughput by better utilizing parallel processing cores. As batch sizes increase, so do memory bandwidth and capacity requirements. Excessively large batches can lead to cache misses and increased memory access latency, thus hindering performance.

Latency Sensitivity

Large batch sizes affect latency because the processor must handle significantly larger datasets simultaneously, increasing computation time. Real-time applications, such as autonomous driving, demand minimal latency, often requiring a batch size of one to ensure immediate response. In safety-critical scenarios, even a slight delay can lead to catastrophic consequences. However, this presents a challenge for accelerators optimized for high throughput, as they are typically designed to process large batches efficiently rather than single-instance workloads.

Continuous Batching Challenges
Continuous batching is a technique where new inputs are dynamically added to a batch as processing progresses, rather than waiting for a full batch to be assembled before execution. This approach reduces latency and improves throughput. It may have an impact on time-to-first token, but provided that the scheduler can handle the execution it achieves higher overall efficiency.

Token Throughput and Its Computational Impact

Token throughput refers to the number of tokens—whether words, sub-words, pixels, or data points—processed per second. It depends on input token sizes and output token rates, requiring high computational efficiency and optimized data movement to prevent bottlenecks.

Token Throughput Requirements
Key to defining token throughput in LLMs is the time to first token output, namely low latency achieved through continuous batching to minimize delays. For traditional LLMs, the output rate must exceed human reading speed, while for agentic AI that relies on direct machine-to-machine communication, maintaining high throughput is critical.

Traditional Transformers vs Incremental Transformers
Most LLMs, such as OpenAI-o1, LLAMA, Falcon, and Mistral, use transformers, which require each token to attend to all previous tokens. This leads to high computational and memory costs. Incremental Transformers offer an alternative by computing tokens sequentially rather than recomputing the full sequence at every step. This approach improves efficiency in streaming inference and real-time applications. However, it requires storing intermediate state data, increasing memory demands and data movement, which impacts throughput, latency, and power consumption.

Further Considerations
Token processing also presents several challenges. Irregular token patterns, such as varying sentence and frame lengths, can disrupt optimized hardware pipelines. Additionally, in autoregressive models, token dependencies can cause stalls in the processing pipeline, reducing the effective utilization of computational resources.

Overcoming Hurdles in Hardware Accelerators
In stark contrast to the CPU that has undergone a remarkable evolutionary journey over the past 70 years, AI accelerators are still in their formative stage, with no established architecture capable of overcoming  all the hurdles in meeting the computational demands of LLMs.

The most critical bottleneck is memory bandwidth, often referred to as the memory wall. Large batches require substantial memory capacity to store input data, intermediate states and activations, while demanding high data transfer bandwidth. Achieving high token throughput depends on fast data transfer between memory and processing units. When memory bandwidth is insufficient, latency increases, and throughput declines. These bottlenecks become a major constraint on computing efficiency, limiting the actual performance to a fraction of the theoretical maximum.

Beyond memory constraints, computational bottlenecks pose another challenge. LLMs rely on highly parallelized matrix operations and attention mechanisms, both of which demand significant computational power. High token throughput further intensifies the need for fast processing performance to maintain smooth data flow.

Data access patterns in large batches introduce additional complexities. Irregular access patterns can lead to frequent cache misses and increased memory access latencies. To sustain high token throughput, efficient data prefetching and reuse strategies are essential to minimize memory overhead and maintain consistent performance.

Addressing these challenges requires innovative memory architectures, optimized dataflow strategies, and specialized hardware designs that balance memory and computational efficiency.

Overcoming the Memory Wall
Advancements in memory technologies, such as high-bandwidth memory (HBM)—particularly HBM3, which offers significantly higher bandwidth than traditional DRAM—help reduce memory access latency. Additionally, larger and more intelligent on-chip caches enhance data locality and minimize reliance on off-chip memory, mitigating one of the most critical bottlenecks in hardware accelerators.

One promising approach involves modeling the entire cache memory hierarchy with a register-like structure that stores data on a single clock cycle rather than requiring tens of clock cycles. This method optimizes memory allocation and deallocation for large batches while sustaining high token output rates, significantly improving overall efficiency.

Enhancing Computational Performance
Specialized hardware accelerators designed for LLM workloads, such as matrix multiplication units and attention engines, can dramatically boost performance. Efficient dataflow architectures that minimize unnecessary data movement and maximize hardware resource utilization further enhance computational efficiency. Mixed-precision computing, which employs lower-precision formats like FP8 where applicable, reduces both memory bandwidth requirements and computational overhead without sacrificing model accuracy. This technique enables faster and more efficient execution of large-scale models.

Optimizing Software Algorithms
Software optimization plays a crucial role in fully leveraging hardware capabilities. Highly optimized kernels tailored to LLM operations can unlock significant performance gains by exploiting hardware-specific features. Gradient checkpointing reduces memory usage by recomputing gradients on demand, while pipeline parallelism allows different model layers to be processed simultaneously, improving throughput.

By integrating these hardware and software optimizations, accelerators can more effectively handle the intensive computational and memory demands of large language models.

About Lauro Rizzatti

Lauro Rizzatti is a business advisor to VSORA, an innovative startup offering silicon IP solutions and silicon chips, and a noted verification consultant and industry expert on hardware emulation.

Also Read:

A Deep Dive into SoC Performance Analysis: Optimizing SoC Design Performance Via Hardware-Assisted Verification Platforms

A Deep Dive into SoC Performance Analysis: What, Why, and How

SystemReady Certified: Ensuring Effortless Out-of-the-Box Arm Processor Deployments


TRNG for Automotive achieves ISO 26262 and ISO/SAE 21434 compliance

TRNG for Automotive achieves ISO 26262 and ISO/SAE 21434 compliance
by Don Dingee on 02-27-2025 at 6:00 am

Synopsys Automotive NIST TRNG

The security of a device or system depends mainly on being unable to infer or guess an alphanumeric code needed to gain access to it or its data, be that a password or an encryption key. In automotive applications, the security requirement goes one step further – an attacker may not gain access per se, but if they can compromise vehicle safety in some way, they can cause significant problems for vehicles, property, and people. A cornerstone of security implementations is truly random numbers, and Synopsys has recently certified its True Random Number Generator (TRNG) for Automotive, achieving ISO 26262 compliance and ISO/SAE 21434 compliance.

Security and safety: increasing concerns for connected vehicles

Cars and trucks are starting to look less like embedded electronics systems and more like enterprise systems as cloud connectivity and CPU and AI processing take on more significant roles. Vehicles can now speak to the cloud, other vehicles, surrounding sensors, traffic signals, parking control, and other infrastructure.

Ensuring vehicle safety now includes preventing unauthorized remote access to its mission-critical systems via wireless communication. Security architectures rely on random numbers for:

  • Cryptographic keys: Modern cryptographic algorithms help increase unpredictability by using secure, hardened keys resistant to high-computational-power cracking schemes.
  • Authentication: Devices must authenticate on a network before participating, using secure tokens and challenge/response codes to verify their identity.
  • Nonce generation and initial values: Many algorithms require a unique, random number as a starting point or a seed value to ensure a data block’s unique processing.
  • Entropy: A need for randomness supporting the development of secure and resilient communication protocols that can withstand sophisticated cyberattacks.

The National Institute of Standards and Technology (NIST) drives standardization for random number generation in the NIST SP 800 family of specifications. NIST SP 800-90A covers deterministic random bit generators, while NIST SP 800-90B defines entropy sources, and NIST SP 800-90C standardizes non-deterministic random bit generators, combining the deterministic and entropic approaches for truly random numbers.

Functional safety via ISO 26262 and automotive cybersecurity via ISO/SAE 21434 add another layer of more formal certification requirements for automotive systems. Both standards help evaluate and categorize risks of system degradation and their severity, pointing developers to areas requiring risk mitigation or elimination. Third-party compliance testing audits automotive electronics and software design processes and verifies implementations.

Extending proven Synopsys TRNG IP solutions to automotive

Synopsys has developed and fielded TRNG IP solutions for many years. The architecture combines signal conditioning with noise sources providing ongoing entropy while not depending on process-specific circuitry, helping make the IP solution easily portable across technologies.

The latest TRNG for Automotive solution provides high-quality random numbers while integrating into automotive systems focused on safety and cybersecurity. The automotive variant of the IP derives from the NIST SP 800-90C compliant TRNG Core. It incorporates additional safety mechanisms enhancing the ability to detect, recover, and report anomalies that can lead to system failures. These mechanisms include parity bus protection for interfaces, dual rail alarms monitoring two separate data paths, and parity protection on input buffers and safety registers.

Third-party compliance evaluation at SGS-TÜV has certified the TRNG for Automotive IP for ISO 26262 with ASIL D compliance for systematic faults and ASIL B compliance for random hardware faults. Compliance with SAE/ISO 21434 cybersecurity processes is also certified by SGS-TÜV for the Automotive TRNG solution.

This no-compromise approach from Synopsys allows automakers and automotive suppliers to create communication and processing schemes with secure, safe cryptographic features based on highly reliable TRNG. More details on TRNG solutions are available online from Synopsys.

Datasheet: Synopsys True Random Number Generator for NIST SP 800-90C

White Paper: Truly Random Number Generators for Truly Secure Systems


Is Arteris Poised to Enable Next Generation System Design?

Is Arteris Poised to Enable Next Generation System Design?
by Mike Gianfagna on 02-26-2025 at 10:00 am

Is Arteris Poised to Enable Next Generation System Design?

The semiconductor ecosystem is changing. Monolithic design is becoming multi-die design. Processors no longer inform software development options. It’s now the other way around with complex AI software informing the design of purpose-built hardware. And all that special-purpose hardware needs drivers to make it come to life. This interplay of complex, multi-chip connectivity and ever-increasing demands of how the software invokes the hardware are all new. This isn’t your father’s (or mother’s) chip design project. All of this made me wonder where the driving forces will be to take us to the next level of semiconductor system design. There are many important players in this field. Recently, I was struck by a series of observations about one of those players. The apparent alignment is noteworthy. In this post, I’ll explore those observations. Is Arteris poised to enable next generation system design?

Some Observations

Most folks think “network on a chip”, or NOC when they hear the name Arteris. The company has certainly blazed an important trail toward automating the interconnect of vast on-chip resources. That is the beginning of the story and not the end, however. Providing the backbone to connect the parts of a complex design opens many doors. Let’s look at a few.

One is the “memory wall” problem. While collections of CPUs and GPUs deliver huge amounts of performance, the memories that manage critical data for those systems lag in performance. And they lag a lot – many orders of magnitude. This is the memory wall problem.

A popular approach to dealing with this issue is to pre-fetch data and store it in a local cache. This way is far faster – a few CPU cycles vs. over 100 CPU cycles. It’s a great approach, but it can be tricky to ensure the right data is in the right place at the right time, and consistent across all caches. Systems that effectively deliver this solution are called cache coherent, and achieving this goal is quite difficult.

Arteris has developed a cache-coherent NOC to address this challenge. That’s one obstacle out of the way. You can learn more about this Arteris solution here.

Another challenge is just keeping track of the huge list of IPs used in contemporary designs. Tasks here include ensuring the right IP is deployed, and all stakeholders have up-to-date information and keeping track of the various configurations and derivatives.  Current designs can contain 500+ up to 1K IP blocks with 200K+ up to 5M+ registers. Challenges this creates include:

  • Content integration from various sources (soft IP, configurable third-party IP, internal legacy
  • Design manipulation (derivative, repartitioning, restructuring)

There are many formats used to keep track of all this. Spreadsheets, DOCX, IP-XACT, and SystemRDL are a few examples. Again, Arteris has a well thought out solution to this problem. Levering its Magillem technology, these problems can be tamed. You can learn more about this Arteris solution here

And I’ll examine one more. How to keep track of all the data required to orchestrate the interface between all that complex software and the hardware that brings those software innovations to life. This is typically called the hardware/software interface, or HSI.

This problem has many dimensions. Not only is it complex, but the formats needed by all stakeholders are different. That is, folks like RTL architects, software developers, verification engineers and technical publication staff all need their own version of this information in a specific format. Generating all that information in lockstep and conveying the same design intent in different formats is not easy.

In a past life, I worked with a company called  that had a very well thought out way of dealing with these challenges. What happened to Semifore?  If you guessed they are now part of Arteris, you would be right. More technology to knock down more of the obstacles to achieve next generation designs. The core tool for Semifore is called CSRCompiler, and the diagram below will give you a sense of what it can do.

CSR Compiler Unifies Design Teams

There is a lot more to be said here, but you get the idea.

What’s Next?

I’ve just scratched the surface, highlighting some high-profile challenges that must be tamed to get to the next level of semiconductor system design. It turns out Arteris has mainstream technology to address all of them. They are a NOC company, and a lot more.

There are other challenges to be met of course. IP-XACT is an important element of advanced system design. This standard, also known as IEEE 1685, is an XML format that defines and describes reusable IP to facilitate its use in chip designs. IP-XACT was created by the SPIRIT Consortium as a standard to enable automated configuration and integration through tools and evolved into an IEEE standard.

There is a new version of this standard called IEEE 1685-2022. This new version contains a lot of additional functionality. It will be important for any company who aims to enable next generation system design to support these new capabilities. A partial list of what’s new includes:

  • Removed conditionality
  • Added XML document to describe memory element definitions
  • Added mode-dependent memory and register access
  • Added mapping from ports to register fields
  • Added register field aliasing and broadcasting
  • Added power domains and power domain bindings

Even though there are new challenges on the horizon, I have an optimistic view of how Arteris can help. In an Electronic Design article, Insaf Meliane, Product Management & Marketing Manager at Arteris stated:

The ever-evolving landscape of semiconductor chip design necessitates effective communication between design teams. The HSI serves as the bridge, and while challenges persist due to differing languages and requirements, tools like the CSRCompiler help simplify the process.

The methodology automatically documents changes across entire functional teams to deliver a reliable, up-to-date specification. It provides a single-source specification for register and memory-map information, fully configured for all teams in the formats and views they require.

This gives me more confidence in the Arteris approach to these problems. Is Arteris poised to enable next generation system design? I think the answer is YES, and I can’t wait to see what’s next.

Also Read:

Arteris Raises Bar Again with AI-Based NoC Design

MCUs Are Now Embracing Mainstream NoCs

Arteris Empowering Advances in Inference Accelerators


Bug Hunting in Multi Core Processors. Innovation in Verification

Bug Hunting in Multi Core Processors. Innovation in Verification
by Bernard Murphy on 02-26-2025 at 6:00 am

Innovation New

What’s new in debugging multi-/many-core systems? Paul Cunningham (GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and lecturer at Stanford, EE292A) and I continue our series on research ideas. As always, feedback welcome.

The Innovation

This month’s pick is Post-Silicon Validation of the IBM POWER9 Processor. This was published in the 2020 DATE conference. The authors are from IBM and the paper has 1 citation.

This topic continues to attract interest given accelerating growth of these platforms among hyperscalers, though for some reason the topic has created barely a ripple among our usual research paper resources. An exception is the IBM Threadmill paper we covered in 2022 and a number of following papers from the same group. Here we review the latest of these papers, describing IBM application on their POWER9 processor.

The same basic approach continues from the earlier paper, testing on post-silicon using a bare-metal exerciser with automated randomization between cycles. Several important refinements have been added. One I find interesting is irritators, used to bias towards multi-thread (and possibly multi-core) conflicts.

 

Paul’s view

We’re zooming back in again this month on randomized instruction generation for processor verification. A few years ago we blogged on a tool called Threadmill used by IBM for verification of their POWER7 processor. This month we’re checking out a short paper on their experiences verifying the POWER9 processor.

More and more companies are developing custom processors based either on Arm-64 or RISC-V ISAs. Arm-based computing is scaling out in datacenters and laptops, and RISC-V processors are becoming widespread in a variety of embedded applications. Verifying processors, especially advanced ones with multiple cores and multiple out-of-order execution pipelines is really hard and somewhat of a dark art.

Threadmill is a low level “exerciser” software program that runs on the bare metal processor directly (i.e.  without any OS). It is configured with templates – snippets of machine code parametrized so they can be randomized in some way – randomize instructions, randomize addresses, etc. The exerciser can be run pre-silicon in simulation, emulation, or on FPGA, and also can be run post-silicon in the lab.

This paper shares several interesting new nuggets on how IBM enhanced Threadmill since POWER7: Firstly, they weighted the runtime allocation of templates during emulation, running templates that find more bugs for 10-100x more clock cycles. Second, they deployed some clever information encoding tricks to assist in debug. For example, for a bug related to dropping a write when multiple cores increment the same memory address, they have each core increment that memory address by a different amount. Then the difference between actual and expected value in that address tells them which core’s write was dropped due to the race. Third, they enhance Threadmill with more tricks to bias randomization to better hit bugs. The original Threadmill paper from POWER7 shares the trick of using the same random seed across multiple cores for memory addresses. This increases the frequency of load/store races. In this POWER9 paper they biased addresses to also align with memory page boundaries, to increase the frequency of cross-page accesses. Lastly, they used AI to help further prioritize templates to hit coverage faster.

All-in compared to POWER8 there were 30% more bugs found in 80% of the time. Decent progress on a very tough problem!

Raúl’s view

State-of-the-art processors, such as the IBM POWER9 processor described in this paper, typically undergo multiple tape-outs. Pre-silicon verification cannot identify all bugs, particularly those related to hard-to-hit software timing issues, very long loops, or deep power states, which are not exposed by the Instruction Set Simulator (ISS). This challenge is exacerbated in multi-core, multi-threaded architectures.

The reviewed paper outlines the validation methodology IBM implemented for the POWER9 processor. In October 2022, we examined the methodology used for validating the POWER7 processor, many aspects of which remain applicable.

The approach involves using a bare-metal, self-contained exerciser called Threadmill, which generates sequences of instructions based on templates. These sequences are executed pre-silicon on the ISS and within a highly instrumented Exerciser on Accelerator (EoA) environment. Root cause analysis is considered complete only when a bug is reproduced across Simulation, EoA, and post-silicon Lab. The paper details numerous practical aspects of this process. For example, when the bug rate declines, hardware irritators are employed to induce new bugs, such as by artificially reducing cache sizes and queue depths. Templates with high RTL coverage that uncover numerous bugs are executed 10 to 100 times longer than usual on the accelerator.

IBM’s overall validation methodology has been improving, with results for POWER9 validation compared to POWER8 showing an increase in bugs found in EoA from 1% to 6%, in post-silicon from 1% to 4%, and a reduction in the days required to root cause 90% of the bugs from 31 to 17.

There are open-source instruction generators for RISC-V available on GitHub. The RISC-V DV (Design Verification) framework, maintained by CHIPS Alliance, is an open-source tool for verifying RISC-V processor cores. FORCE-RISC-V, an instruction sequence generator for the RISC-V instruction set architecture from Futurewei supports multi-core and multi-threaded instruction generation.

Overall, the paper provides valuable insights, especially for practitioners involved in processor validation.

Also Read:

Embracing the Chiplet Journey: The Shift to Chiplet-Based Architectures

2024 Retrospective. Innovation in Verification

Accelerating Automotive SoC Design with Chiplets

Accelerating Simulation. Innovation in Verification


2025 Outlook with Dr. Rui Tang of MSquare Technology

2025 Outlook with Dr. Rui Tang of MSquare Technology
by Daniel Nenni on 02-25-2025 at 10:00 am

Tang Rui 18 10 04 113961

Tell us a little bit about yourself and your company. 

I am Rui Tang, co-founder and VP of MSquare Technology. With a Ph.D. in Computer Engineering from Northeastern University and a master’s degree in management science and engineering from Stanford University, I bring over 23 years of experience in the IC industry. Prior to MSquare, I served as Chief Strategy Officer and General Manager at FuriosaAI, Investment Director at BOE Venture, Staff Engineer at Apple, and Principal Engineer at Oracle.

Founded in 2021, MSquare Technology is a leading provider of integrated circuit IPs and Chiplet solutions, committed to addressing challenges in chip interconnectivity and vertical integration in the smart economy era. With offices in Shanghai, Taipei, Sydney, and San Jose, and a team of over 150 employees—80% dedicated to research and development—we aim to foster an open ecosystem platform for AI and Chiplets. Our mission is to empower innovation and growth across the IP and Chiplet industry, leveraging our extensive, cutting-edge, and cost-effective IP library to meet the diverse needs of our global clients.

What was the most exciting high point of 2024 for your company?

In June 2024, MSquare Technology reached a significant milestone with the readiness of our Chiplet product–ML100 IO Die, designed to bridge the connection between UCIe and HBM (High Bandwidth Memory). This year, we achieved successful commercial adoption of the IO Die ML100 by two leading clients, marking a major step in the product’s market validation and implementation.

As a high-bandwidth memory solution, the ML100 IO Die integrates efficient Die-to-Die interconnect IP and supports the UCIe 1.1 protocol. It delivers a maximum bandwidth of 819.2GB/s and supports data transfer rates of 6400 Mbps. This product meets the stringent requirements of AI applications for high bandwidth and low power consumption, enabling ultra-low latency interconnects between chips and high-speed data transfer. These capabilities greatly enhance the efficiency of AI model training and inference, contributing to the continued advancement of AI technologies.

In addition to the success of ML100 IO Die, we’ve expanded our product portfolio with multiple new offerings in 2024, further strengthening our position in the semiconductor industry. Our IP products have been validated by the top 5 foundries, covering over 400 process nodes from 5nm to 180nm.

What was the biggest challenge your company faced in 2024?

In 2024, the semiconductor industry faced a confluence of macroeconomic fluctuations, geopolitical tensions, technological evolution, and tightened capital flows. Many chip startups struggled to survive and were eventually eliminated in the fierce competition. However, challenges often come hand-in-hand with opportunities.

From a technical perspective, MSquare encountered unique challenges in advancing Chiplet architecture, which is regarded as the future trend of the semiconductor industry. While Chiplet technology offers the potential for higher computational performance and resource integration, it also brings several critical hurdles:

  1. Complexity in Heterogeneous Architecture Disaggregation: Breaking down GPU or ASIC designs into Chiplets poses significant technical bottlenecks, as avoiding increases in area and cost while optimizing data transfer performance remains challenging.
  2. Interconnect Standards and Compatibility Issues: Although the release of the UCIe standard provides a foundation for the Chiplet ecosystem, the actual implementation is hindered by compatibility issues between different vendors’ IPs. This increases the ecosystem’s closed nature and complicates product development.

These industry-wide technical challenges demand robust innovation capabilities and high execution efficiency from companies to navigate successfully.

How is your company’s work addressing this biggest challenge? 

To address these challenges, MSquare has successfully turned obstacles into opportunities for growth through three key strategies: technological breakthroughs, ecosystem collaboration, and resource focus:

  1. Technological Innovation and Product Breakthroughs
    • In Chiplet architecture design, MSquare introduced its self-developed ML100 IO Die, which optimizes interconnect efficiency and power consumption during data processing. This innovation effectively overcomes bandwidth bottlenecks in multi-Chiplet collaboration.
    • To address memory bandwidth limitations and constraints in main chip design, we developed the groundbreaking M2LINK solution. This solution converts the HBM protocol to the UCIe protocol and integrates it into a standard module using RDL interposer packaging, while enabling seamless integration with the main SoC. This breakthrough delivers several technical and commercial advantages:
      • Reduces main chip and packaging costs
      • Enhances memory capacity and bandwidth performance
      • Increases design flexibility and shortens product development cycles
      • Ensures stability for high-performance computing and AI applications
  2. Ecosystem Collaboration and Standards Advocacy
    • MSquare actively participates in industry collaborations to promote the adoption of UCIe standards, improving IP compatibility and openness within the Chiplet ecosystem. Through deep cooperation with customers and industry partners, we have optimized Chiplet design and production processes, lowering the technical barriers for customers adopting MSquare’s technologies.
  3. Strategic Focus and Resource Optimization
    • MSquare concentrates on fast-growing market segments such as AI, high-performance computing (HPC), and data centers, channeling resources into breakthrough innovations in these areas.
    • By avoiding the inefficiencies of blind expansion, our resource allocation strategy has significantly enhanced R&D efficiency and customer delivery capabilities, further strengthening our market competitiveness.
What do you think the biggest growth area for 2025 will be, and why?

Looking ahead to 2025, the biggest growth area is expected to be the rapid expansion of AI and data centers, as AI-powered systems and devices continue to transform industries. This growth is driven by increasing demand for chips that address three core challenges in AI development: memory capacity, interconnect bandwidth, and computing performance.

As AI workloads grow more complex, there will be a surge in demand for innovative semiconductor solutions capable of meeting these evolving requirements. Additionally, the growth of AI is triggering significant changes in data center architectures, creating new opportunities for IP and Chiplet suppliers to innovate. One notable trend is the shift from traditional LPO/NPO approaches to Co-Packaged Optics (CPO) technology, which enables higher bandwidth and power efficiency. Furthermore, Optical Chiplets are gaining traction, particularly in their use within switches and in facilitating communication between XPUs and memory.

This transformative wave is reshaping the semiconductor landscape and presenting unprecedented opportunities for companies at the forefront of AI and Chiplet innovation.

How is your company’s work addressing this growth? 

MSquare is addressing the rapidly growing demands in the AI and data center industries with innovative Chiplet solutions, such as our ML100 IO Die. This product offers two flexible configurations: UCIe + HBM3 and UCIe + LPDDR. The UCIe + HBM3 solution decouples HBM from the SoC, reducing the influence of SoC temperature on HBM performance and ensuring compatibility with HBM3 chips for flexible placement. The UCIe + LPDDR solution decouples the Memory PHY from the SoC, increasing memory capacity and offering integration options that accelerate product upgrades.

The ML100 IO Die is built on the UCIe 1.1 Specification, enabling ultra-high bandwidth of up to 1 TB/s with ultra-low latency. This architecture ensures seamless communication between dies, while the integrated HBM3 IP supports transfer rates of up to 6400 Mbps. These technologies provide high flexibility and scalability, empowering customers to overcome performance bottlenecks in AI workloads and data center applications.

To further support the adoption of Chiplet designs, MSquare is continuously developing high-speed interface IPs, including UCIe, HBM, and LPDDR, while collaborating with ecosystem partners in packaging and manufacturing. This approach not only ensures seamless integration but also accelerates solution implementation for customers. With these advancements, MSquare is well-positioned to lead the industry and drive innovation in next-generation AI and data center technologies.

What conferences did you attend in 2024 and how was the traffic?

In 2024, we participated in several prominent conferences across the globe, including:

  • IIC 2024, Shanghai
  • AI Hardware & Edge AI Summit 2024, San Jose
  • SemiBAY, Shenzhen
  • EE Tech Summit, Taipei
  • ICCAD-Expo 2024, Shanghai

Traffic at these events varied. ICCAD-Expo 2024, for example, exceeded expectations with over 6,700 attendees, highlighting the strong interest in advanced chip design. Similarly, AI-focused events like the AI Hardware & Edge AI Summit saw significant year-on-year growth in attendance, reflecting the increasing momentum in artificial intelligence and edge computing sectors.

On the other hand, events such as SemiBAY and EE Tech Summit had moderate but steady traffic, providing valuable opportunities for targeted networking and discussions in more traditional semiconductor and electronics domains.

Overall, the increased attendance at AI and chip design conferences aligns with the industry’s shift toward intelligent computing and advanced architectures, reinforcing our focus on these high-growth areas.

Will you attend conferences in 2025? Same or more?

Yes, we plan to attend conferences in 2025, and the number will likely increase compared to 2024. Based on the positive outcomes and valuable connections we gained from this year’s events, we aim to expand our presence at both existing and new conferences.

In particular, we will focus on attending events in high-growth areas such as artificial intelligence and Chiplet, aligning with industry trends and our strategic priorities. AI-related conferences, which have seen a rise in attendance and engagement, will remain a key focus for us.

Additionally, we are exploring opportunities to participate in more global forums to strengthen our international visibility. By attending a broader range of conferences, we aim to stay at the forefront of emerging technologies and continue fostering partnerships that drive innovation.

How do customers engage with your company?

Customers engage with us through multiple channels, ensuring a seamless and collaborative experience throughout their journey:

  1. Direct Collaboration:
    We work closely with customers through direct consultations to understand their needs, provide tailored IP solutions, and assist in their chiplet design and integration efforts. Our technical team actively supports their development process to ensure smooth implementation.
  2. Conferences and Industry Events:
    Conferences such as IIC, ICCAD-Expo, and AI Hardware Edge AI Summit serve as key touchpoints where customers can engage with us in person. These events allow us to showcase our latest products, demonstrate use cases, and discuss collaborative opportunities face-to-face.
  3. Digital Channels:
    Many customers reach out to us through our website, online webinars, and social media platforms. These channels provide accessible ways for customers to learn about our offerings, request product demos, and initiate discussions.
  4. Long-Term Partnerships:
    For our strategic partners, engagement is often a continuous, multi-phase process. We work as an extension of their teams, offering co-development opportunities and customized solutions aligned with their roadmaps.

By offering diverse engagement methods, we ensure that customers have the flexibility to interact with us in ways that best suit their needs, ultimately building strong, long-term relationships.

Additional questions or final comments? 

As we look ahead to 2025, we’re excited about the continued opportunities in Chiplet, AI, and high-performance computing. Our team is focused on pushing the boundaries of innovation while maintaining strong customer relationships. If there are any specific questions regarding our products or upcoming developments, we’d be happy to discuss them further.

We appreciate your interest in our company, and we look forward to exploring new ways to collaborate and contribute to the advancement of the semiconductor industry in the years to come. Feel free to reach out anytime!

Also Read:

How Synopsys Enables Gen AI on the Edge

Harnessing Modular Vector Processing for Scalable, Power-Efficient AI Acceleration

2025 Outlook with Paul Wells of sureCore


Synopsys Expands Hardware-Assisted Verification Portfolio to Address Growing Chip Complexity

Synopsys Expands Hardware-Assisted Verification Portfolio to Address Growing Chip Complexity
by Kalar Rajendiran on 02-25-2025 at 6:00 am

Synopsys HAV Product Family

Last week, Synopsys announced an expansion of their Hardware-Assisted Verification (HAV) portfolio to accelerate semiconductor design innovations. These advancements are designed to meet the increasing demands of semiconductor complexity, enabling faster and more efficient verification across software and hardware domains.

At the pre-launch event, an insightful fireside chat took place between Ravi Subramanian, chief product management officer, Synopsys and Salil Raje, senior vice present and general manager, Adaptive and Embedded Computing Group at AMD. As they were not seated next to a fireplace, Ravi jokingly commented that fire is on at a fireplace somewhere. Having listened to this chat session and talks from Nvidia’s Narendra Konda, vice president, hardware Engineering and Arm’s Kevork Kechichian, executive vice present, Solutions Engineering, I’d say Synopsys is on fire and firing on all cylinders. AMD, Arm and Nvidia are a few among a much longer list of top tier customers who have already used the products that were announced last week.

The features, functionalities and specifications of the announced products are one thing to review. What is even more interesting is understanding why would top tier customers – who were given early access well before the launch – opt to run their key projects on these new solutions. What customer challenges are being addressed by the newly announced products from Synopsys. To uncover that, I chatted with Synopsys’ Tom De Schutter, SVP of Product Management and System Solutions and Frank Schirrmeister, Executive Director, Product Management and Systems Solutions.

Compounding Complexities Drive Advances in HAV

The rapid evolution of semiconductor technology is driven by four compounding complexities: software, hardware, interface, and architecture. As modern chips integrate billions of transistors, support heterogeneous computing, and execute increasingly software-defined workloads, ensuring their functionality has never been more challenging.

Software Complexity: Modern systems run sophisticated software stacks, requiring early software validation alongside hardware verification. HAV platforms enable pre-silicon software bring-up using emulation and FPGA-based prototyping.

Hardware Complexity: As chip architectures scale with more cores, accelerators, and advanced power management techniques, HAV provides critical functional verification and debugging to catch hardware flaws early.

Interface Complexity: Next-generation devices rely on high-speed interconnects (PCIe, DDR, Ethernet, UCIe, etc). HAV platforms help validate these interfaces under real-world conditions before silicon is produced.

Architectural Complexity: Workload-specific optimizations in AI, automotive, and cloud computing require extensive architectural validation. HAV allows engineers to analyze performance, debug bottlenecks, and refine power efficiency.

The interplay of software, hardware, interface, and architectural complexity makes HAV not just beneficial, but essential. Companies that fail to account for this compounding effect risk costly post-silicon bugs, longer time-to-market, and performance inefficiencies. HAV solutions provide a scalable way to tackle these challenges, ensuring robust and efficient semiconductor designs.

Verification happens in multiple phases, both pre-silicon and post-silicon, to ensure that designs function correctly before and after fabrication. With semiconductor innovations pushing boundaries, HAV ensures functional correctness, accelerates software development, and enables real-world validation, making it an essential pillar in modern chip design, amid growing complexities.

The Four Key Highlights of the Announcement

Synopsys’ HAV portfolio expansion strengthens the ability to manage hardware/software co-design, ensuring faster, more efficient integration and validation of complex systems in industries that depend on high-performance, real-time software.

Next-Generation Hardware Engines

Synopsys has introduced ZeBu-200 for emulation and HAPS-200 for prototyping, delivering the fastest performance in the industry to significantly accelerate verification speed and efficiency. These next-generation hardware engines enable faster debug cycles, higher throughput, and improved system validation, making them essential for today’s complex semiconductor designs.

The HAPS-200 system delivers up to twice the performance and quadruples the debug bandwidth compared to its predecessor, the HAPS-100. It supports configurations ranging from single FPGA setups to multi-rack systems, accommodating designs up to 10.8 billion gates.

The Zebu-200 system delivers up to twice the performance and up to 8X better debug bandwidth compared to its predecessor, the Zebu-EP. It can accommodate designs up to 15.4 billion gates.

EP-Ready Expansion for Flexibility

Building on the success of the ZeBu-EP platform, Synopsys has now expanded EP-Ready configurability to both ZeBu (emulation) and HAPS (prototyping) platforms. This enhanced flexibility allows users to dynamically configure compute and connectivity resources, optimizing their investment across different verification workflows while seamlessly transitioning between emulation and prototyping.

Scalability for Massive Designs

With the extension of Modular HAV methodology to ZeBu Server 5, Synopsys now supports designs exceeding 60 billion gates, addressing the verification needs of next-generation, ultra-large semiconductor architectures. This expanded capacity enables efficient pre-silicon validation of complex chips, including multi-die and high-performance computing (HPC) systems, ensuring faster time-to-market.

Virtualization for Automotive and System Software

By combining virtual models running on a host server connected to a HAV system, software bring-up processes can be accelerated. Synopsys has integrated its HAV portfolio with Synopsys Virtualizer, creating this hybrid verification environment. This approach is especially critical for automotive software, where real-time system validation and high-performance execution are essential. By leveraging virtualization, engineers can develop, test, and optimize system software earlier in the design cycle, reducing risks and accelerating deployment.

EP-Ready Hardware Platforms, A Key Differentiator

EP-Ready stands for Emulation-Prototyping-Ready and is a key differentiator in Synopsys’ latest HAV portfolio expansion. The EP-Ready Hardware serves as the base module for compute and connectivity cabling for both prototyping and emulation. Initially introduced with the ZeBu-EP platform in 2022, this flexibility and configurability received strong positive feedback from customers. Now, Synopsys has extended EP-Ready Hardware capabilities to both HAPS-200 (prototyping) and ZeBu-200 (emulation) platforms, allowing users to seamlessly configure their verification environment based on workload needs. This unified approach optimizes performance, scalability, and return on investment (ROI) by providing a streamlined hardware foundation for multiple verification use cases. Customers do not have to decide on the balance of emulation and prototyping hardware upfront.

Summary

Synopsys has announced a significant expansion of its Hardware-Assisted Verification (HAV) portfolio, introducing new solutions to accelerate chip development and system validation. By expanding its HAV portfolio, the company continues to lead the way in high-performance verification, empowering semiconductor designers with scalable, configurable, and software-enabled solutions to drive next-generation chip innovation.

Video: Unveiling Next-Gen HAPS Prototyping & ZeBu Emulation Hardware-Assisted Verification Solutions

For more information and to access HAPS-200 and Zebu-200 spec sheets, visit the emulation and prototyping solutions page here.

For comprehensive Systems Verification and Validation solutions, visit here.

Also Read:

How Synopsys Enables Gen AI on the Edge

What is Different About Synopsys’ Comprehensive, Scalable Solution for Fast Heterogeneous Integration

Will 50% of New High Performance Computing (HPC) Chip Designs be Multi-Die in 2025?


How Synopsys Enables Gen AI on the Edge

How Synopsys Enables Gen AI on the Edge
by Mike Gianfagna on 02-24-2025 at 10:00 am

How Synopsys Enables Gen AI on the Edge

Artificial intelligence and machine learning have undergone incredible changes over the past decade or so. We’ve witnessed the rise of convolutional neural networks and recurrent neural networks. More recently, the rise of generative AI and transformers. At every step, accuracy has been improved as depicted in the graphic above. These enhancements have also increased the power and “footprint” of AI as the technology moved from automated analysis to automated creation. And all the while the thirst for processing power and massive data manipulation has continued to grow in an exponential fashion.

The broader footprint of Gen AI has also created a shift in the host environment. While the cloud offers tremendous processing power, it also creates serious challenges with latency and data privacy. The edge has emerged as a promising place to address these challenges, but the edge can’t always offer the same resources as the cloud. And power can be a scarce resource. Synopsys is developing IP and software to tame these problems. Let’s examine the challenges of Gen AI and how Synopsys enables Gen AI on the edge.

Why Edge-Based Gen AI?

Transformers have enabled Gen AI, which leverages transformer models to generate new data, such as text, images, or even music, based on learned patterns. The ability of transformers to understand and generate complex data has made them the backbone of AI applications such as ChatGPT. These models demand incredible processing power and massive data manipulation. While the cloud offers all of these capabilities, it is not the ideal place to run these technologies.

On of the reasons for this is latency. Applications like autonomous driving, real-time translation, and voice assistants require instantaneous responses, which can be hindered by the latency associated with cloud-based processing. Privacy and security also come into play.  Sending sensitive data to the cloud for processing introduces risks related to data breaches. By keeping data processing local to the device, privacy can be enhanced and the potential for security vulnerabilities can be reduced.

Limited connectivity is another factor. In remote or underserved areas with unreliable internet access, Gen AI enabled edge devices can operate independently of cloud connectivity, ensuring continuous functionality. This is crucial for applications like disaster response, where reliable communication infrastructure is likely compromised.

The Challenges Posed by Gen AI on the Edge

As they say, there is no free lunch. This certainly applies to Gen AI on the edge. It solves a lot of problems but poses many as well. The computational complexity of Gen AI models creates a lot of the challenges. And transformers, which are the backbone of Gen AI models, contribute to this issue due to their attention mechanisms and extensive matrix multiplications. These operations require significant processing power and memory, which can strain the limited computational resources available on edge devices.

Also, edge devices often need to perform real-time processing, especially in applications like autonomous driving or real-time translation. The associated high computational demands can make it difficult to meet speed and responsiveness requirements. The figure below illustrates the incredible rise in complexity presented by large language models.

LLM Model Complexity Trend

Taming the Challenges

Using the best embedded processor for running Gen AI on edge devices helps to overcome challenges. Computational power, energy efficiency, and flexibility to handle various AI workloads must all be considered. Let’s look at some of the options available.

  • GPUs and CPUs deliver flexibility and programmability. This makes them suitable for a wide range of AI applications. They may not be the most power-efficient options for the edge, however. For example, GPUs can consume substantial power, presenting a challenge for battery-operated environments.
  • ASICs deliver a hardwired, targeted solution that works well for specific tasks. However, this approach lacks flexibility. Evolving AI models and workloads cannot be easily accommodated.
  • Neural Processing Units (NPUs) strike a balance between flexibility and efficiency. NPUs are designed specifically for AI workloads, offering optimized performance for tasks like matrix multiplications and tensor operations, which are essential for running Gen AI models. This approach provides a programmable, power-efficient solution for the edge.

Synopsys ARC NPX NPU IP are designed specifically for AI workloads, creating an effective way to enable Gen AI on the edge. For instance, running a Gen AI model like stable diffusion on an NPU can consume as little as 2 Watts, compared to 200 Watts on a GPU. NPUs also support advanced features like mixed-precision arithmetic and memory bandwidth optimization, which are essential for handling the computational demands of Gen AI models. The figure below summarizes the relative benefits of NPUs compared to other approaches for Gen AI on the edge.

The Relative Benefits of NPUs for Gen AI on the Edge

Beyond the benefits of its ARC NPX NPU IP, Synopsys also provides high-productivity software tools to accelerate application development. The ARC MetaWare MX Development Toolkit includes compilers and a debugger, a neural network software development kit (SDK), virtual platform SDKs, runtimes and libraries, and advanced simulation models. 

Designers can automatically partition algorithms across MAC resources for highly efficient processing. For safety-critical automotive applications, the MetaWare MX Development Toolkit for Safety includes a safety manual and a safety guide to help developers meet the ISO 26262 requirements and prepare for ISO 26262 compliance testing.

To Learn More

If Gen AI is finding its way into your products plans, you will soon face the challenges associated with edge-based implementation. The good news is that Synopsys offers world-class IP and development tools that reduce the challenges you will face. There are many good resources that will help you explore these solutions.

  • An informative article entitled The Rise of Generative AI on the Edge is available here.
  • A short but very informative video from the Embedded Vision Summit is also available. Gordon Cooper, product marketing manager for ARC AI processor IP at Synopsys provides many useful details about the IP and some unique applications. You can view the video here.
  • If you want to explore computer vision and image processing, there is an excellent blog on these topics here.
  • And you can visit the Synopsys website to learn about the complete Synopsys ARC NPX NPU IP here.

And that’s how Synopsys enables Gen AI on the edge.


Harnessing Modular Vector Processing for Scalable, Power-Efficient AI Acceleration

Harnessing Modular Vector Processing for Scalable, Power-Efficient AI Acceleration
by Jonah McLeod on 02-24-2025 at 6:00 am

shutterstock 2425981653

The dominance of GPUs in AI workloads has long been driven by their ability to handle massive parallelism, but this advantage comes at the cost of high-power consumption and architectural rigidity. A new approach, leveraging a chiplet-based RISC-V vector processor, offers an alternative that balances performance, efficiency, and flexibility, steering towards heterogeneous computing to align with AI/ML-driven workloads. By rethinking vector computation and memory bandwidth management, a scalable AI accelerator could rival NVIDIA’s GPUs in both cloud and edge computing.

A modular chiplet architecture is essential for achieving scalability and efficiency. Instead of a monolithic GPU design, a system composed of specialized chiplets can optimize different aspects of AI computation. A vector processing chiplet, built on the RISC-V vector extension, serves as the primary computational unit, dynamically adjusting vector length to fit workloads of varying complexity. A matrix multiplication accelerator complements this unit, handling the computationally intensive operations found in neural networks.

Like matrix multiplication accelerators, tensor cores, cryptography, and AI/ML accelerators enhance efficiency and performance. To address the memory bottlenecks that often slow down AI inference and training, high-bandwidth on-package memory chiplets integrate closely with the compute units, reducing latency and improving data flow. Managing these interactions, a scalar processor chiplet oversees execution scheduling, memory allocation, and communication across the entire system.

One of the fundamental challenges in AI acceleration is mitigating instruction stalls caused by memory latency. Traditional GPUs rely on speculative execution and complex replay mechanisms to handle these delays, but a chiplet-based RISC-V vector processor could take a different approach by implementing time-based execution scheduling. Instructions are pre-scheduled into execution slots, eliminating the need for register renaming and reducing overhead. By intelligently pausing execution during stalled loads, an advanced execution time freezing mechanism redefines RISC-V vector processing, ensuring peak performance and power efficiency. This architecture eliminates inefficiencies and unlocks the full potential of vector computing, keeping vector units fully utilized. This ‘Fire and Forget’ time-based execution scheduling enables parallelism, low power, and minimal overhead while maximizing resource utilization and hiding memory latencies.

Chiplet communication plays a pivotal role in determining overall system performance. Unlike monolithic GPUs that rely on internal bus architectures, a chiplet-based AI accelerator needs a high-speed interconnect to maintain seamless data transfer. The adoption of UCIe (Universal Chiplet Interconnect Express) could provide an efficient die-to-die communication framework, reducing latency between compute and memory units. An optimized network-on-chip (NoC) further ensures that vector instructions and matrix operations flow efficiently between chiplets, preventing bottlenecks in high-throughput AI workloads.

Competing with NVIDIA’s ecosystem requires more than just hardware innovation. Higher vector unit utilization keeps the vector pipeline fully active, maximizing throughput and eliminating idle cycles. Fewer stalls and pipeline flushes prevent execution misalignment, ensuring smooth and efficient instruction flow. Superior power efficiency reduces unnecessary power consumption by pausing execution only when needed, and optimized instruction scheduling aligns vector execution precisely with data availability, boosting overall performance.

Software plays an equally important role in adoption and usability. A robust compiler stack optimized for RISC-V vector and matrix extensions ensures that AI models can take full advantage of the hardware. Custom libraries tailored for deep learning frameworks such as PyTorch and TensorFlow bridge the gap between application developers and hardware acceleration. A transpilation layer such as CuPBoP (CUDA for Parallelized and Broad-range Processors) enables seamless migration from existing GPU-centric AI infrastructure, lowering the barrier to adoption.

CuPBoP presents a compelling pathway for enabling CUDA workloads on non-NVIDIA architectures. By supporting multiple Instruction Set Architectures (ISAs), including RISC-V, CuPBoP enhances cross-platform flexibility, allowing AI developers to execute CUDA programs without the need for intermediate portable programming languages. Its high CUDA feature coverage makes it a robust alternative to existing transpilation frameworks, ensuring greater compatibility with CUDA-optimized AI workloads. By leveraging CuPBoP, RISC-V developers could bridge the gap between CUDA-native applications and high-performance RISC-V architectures, offering an efficient, open-source alternative to proprietary GPU solutions.

Energy efficiency is another area where a chiplet-based RISC-V accelerator can differentiate itself from power-hungry GPUs. Fine-grained power gating allows inactive compute units to be dynamically powered down, reducing overall energy consumption. Near-memory computing further enhances efficiency by placing computation as close as possible to data storage, minimizing costly data movement. Optimized vector register extensions ensure that AI workloads make the most efficient use of available compute resources, further improving performance-per-watt compared to traditional GPU designs.

Interestingly, while the idea of a RISC-V chiplet-based AI accelerator remains largely unexplored in public discourse, there are signals that the industry is moving in this direction. Companies such as Meta, Google, Intel, and Apple have all made significant investments in RISC-V technology, particularly in AI inference and vector computing. However, most known RISC-V AI solutions, such as those from SiFive, Andes Technology, and Tenstorrent, still rely on monolithic SoCs or multi-core architectures, rather than a truly scalable, chiplet-based approach.

A recent pitch deck from Simplex Micro suggests that a time-based execution model and modular vector processing architecture could dramatically improve AI processing efficiency, particularly in high-performance AI inference workloads. While details on commercial implementations remain sparse, the underlying patent portfolio and architectural insights indicate that the concept is technically feasible. (see table)

Patent # Patent Title Granted
US-11829762-B2 Time-Resource Matrix for a Microprocessor 11/28/2023
US-12001848-B2 Phantom Registers for a Time-Based CPU 11/12/2024
US-11954491-B2 Multi-Threaded Microprocessor with Time-Based Scheduling 4/9/2024
US-12147812-B2 Out-of-Order Execution for Loop Instructions 11/19/2024
US-12124849-B2 Non-Cacheable Memory Load Prediction 10/22/2024
US-12169716-B2 Time-Based Scheduling for Extended Instructions 12/17/2024
US-11829767-B2 Time-Aware Register Scoreboard 11/28/2023
US-11829762-B2 Statically Dispatched Time-Based Execution 11/28/2023
US-12190116-B2 Optimized Instruction Replay System 1/7/2025

The strategic positioning of such an AI accelerator depends on the target market. Data centers seeking alternatives to proprietary GPU architectures would benefit from a flexible, high-performance RISC-V-based AI solution. Edge AI applications, such as augmented reality, autonomous systems, and industrial IoT, could leverage the power efficiency of a modular vector processor to run AI workloads locally without relying on cloud-based inference. By offering a scalable, customizable solution that adapts to the needs of different AI applications, a chiplet-based RISC-V vector accelerator has the potential to challenge NVIDIA’s dominance.

As AI workloads continue to evolve, the limitations of traditional monolithic architectures become more apparent. A chiplet-based RISC-V vector processor is more adaptable to customization, modular, scalable, high-performance, power-efficient, and cost-effective—ideal for AI, ML, and HPC within an open-source ecosystem. A chiplet-based RISC-V vector processor represents a shift toward a more adaptable, energy-efficient, and open-source approach to AI acceleration. By integrating time-based execution, high-bandwidth interconnects, and workload-specific optimizations, this emerging architecture could pave the way for the next generation of AI hardware, redefining the balance between performance, power, and scalability.

Also Read:

Webinar: Unlocking Next-Generation Performance for CNNs on RISC-V CPUs

An Open-Source Approach to Developing a RISC-V Chip with XiangShan and Mulan PSL v2

2025 Outlook with Volker Politz of Semidynamics