Semiwiki EDA Webinar 800x100

Agentic AI and the Future of Chip Design: From Productivity Tool to Engineering Partner

Agentic AI and the Future of Chip Design: From Productivity Tool to Engineering Partner
by Kalar Rajendiran on 06-15-2026 at 6:00 am

ESDA Panel Session June 10, 2026 IMG 6708

Highlights from a recent panel session moderated by Ed Sperling (Semiconductor Engineering) featuring Walden Rhines (Silvaco), Vincent Wong (Verific), Dave Kelf (Breker Verification Systems), Shelly Henry (MooresLab AI), Ann Wu (Silimate), and Cindy Cui (ChipAgents). The panel session was hosted by Electronic System Design Alliance (ESDA), a SEMI Technology Community.

The Question Is No Longer Whether AI Will Change Semiconductor Design

For decades, semiconductor engineers have faced the same challenge: growing design complexity combined with shrinking schedules. Every new process node, every new architecture, and every new application has increased the burden on design and verification teams.

At a recent industry panel on Agentic AI and semiconductor development, one message emerged clearly: artificial intelligence is no longer an experimental technology on the periphery of Electronic Design Automation (EDA). It is rapidly becoming a core component of the design and verification workflow.

What remains uncertain is not whether AI will be adopted, but how the industry will establish trust, accountability, and engineering discipline around its use.

From Copilot to Autonomous Engineering Systems

Today’s AI applications in semiconductor design largely function as assistants. They help generate RTL, create verification collateral, write testbenches, automate documentation, and accelerate debugging.

The next phase is agentic AI—systems capable of executing multi-step engineering tasks with minimal human intervention.

Several panelists described a future in which AI agents collaborate across the design flow, from specification and architecture through implementation, verification, physical design, and signoff.

The attraction is obvious. Chip development cycles measured in years could potentially be compressed into months. Engineering teams could explore far larger design spaces than would be practical using traditional methodologies.

But increased automation introduces a fundamental challenge: trust.

Productivity Is Easy. Correctness Is Hard.

One of the recurring themes throughout the discussion was that semiconductor design differs fundamentally from many other AI application domains.

A recommendation engine can occasionally be wrong without significant consequences. A chip cannot.

A single functional bug can result in months of schedule delay and millions of dollars in respin costs. As a result, the industry’s standards for correctness are exceptionally high.

The panel repeatedly returned to a central question:

How do engineers verify that AI-generated outputs are correct?

Generating RTL is relatively straightforward. Proving that RTL correctly implements the specification is much harder.

Generating a testbench is relatively straightforward. Demonstrating that the testbench adequately verifies the design is much harder.

As one panelist noted, speed without correctness is not progress.

The future of agentic AI in semiconductor design will therefore depend as much on validation technologies as on generation technologies.

Human-in-the-Loop Is Not Going Away

Despite excitement around autonomous agents, none of the panelists argued that fully autonomous chip tapeouts are imminent.

Today’s large language models still hallucinate. They can produce outputs that appear plausible while being fundamentally incorrect. This characteristic creates significant risk in engineering applications where subtle errors can remain hidden until late in the development cycle.

Consequently, most panelists advocated a “human-in-the-loop” approach.

Engineers will increasingly supervise AI systems rather than manually perform every task. Critical decisions will still require human review, signoff, and accountability.

The role of the engineer changes from creator to reviewer, architect, and orchestrator.

In that sense, AI does not eliminate engineering judgment—it increases its importance.

The Need for Industry Benchmarks

A particularly important topic was measurement.

How should the industry evaluate AI systems?

Traditional metrics such as code generation speed or productivity improvements are insufficient. Semiconductor companies need objective ways to assess quality, correctness, coverage, and reliability.

Several participants argued that the industry lacks common benchmarks capable of evaluating AI-generated designs and verification environments.

Without trusted benchmarks, organizations risk comparing tools based on marketing claims rather than measurable engineering outcomes.

The development of standardized benchmark suites may become one of the most important enablers of widespread AI adoption in EDA.

Design Space Exploration May Be AI’s Greatest Contribution

While much attention focuses on automation, another theme emerged repeatedly: exploration.

Historically, engineers have been constrained by time and computational resources. Only a limited number of architectural alternatives could be evaluated before schedules forced decisions.

Agentic AI changes that equation.

AI systems can evaluate dramatically larger solution spaces, explore more architectural options, and identify design tradeoffs that would be impractical to examine manually.

Several panelists compared this opportunity to earlier EDA revolutions such as synthesis and high-level synthesis. The true value may not be simple productivity gains but rather the ability to discover better solutions.

In this view, AI’s greatest contribution may be improving quality of results rather than merely reducing engineering effort.

Democratizing Chip Design

Another intriguing possibility is that AI could lower barriers to semiconductor development.

Historically, advanced chip design has required large engineering teams, substantial capital, and extensive infrastructure. As AI automates portions of the workflow, smaller organizations may gain the ability to build increasingly sophisticated devices.

This could create a broader and more diverse semiconductor ecosystem.

Rather than concentrating innovation among a handful of hyperscalers and semiconductor giants, agentic AI may enable startups and specialized companies to develop highly optimized solutions for specific applications.

The result could be an explosion of custom silicon.

What Happens to Engineers?

Perhaps the most important question from the audience concerned workforce impact.

Will AI replace engineers?

The panel’s consensus was nuanced.

Entry-level tasks involving documentation, repetitive coding, collateral generation, and routine verification activities are likely to become increasingly automated.

However, the need for experienced engineers is unlikely to disappear.

Future engineers will require a different skill set:

  • Systems thinking
  • Architectural reasoning
  • Cross-domain understanding
  • AI supervision and orchestration
  • Verification and validation expertise
  • Critical evaluation of AI-generated outputs

In short, engineers will spend less time producing artifacts and more time evaluating, directing, and integrating them.

The concern is not necessarily that engineering jobs disappear. The concern is ensuring that future engineers still develop sufficient expertise to become the architects and technical leaders of tomorrow.

The Road Ahead

The semiconductor industry has experienced several transformative shifts over the past four decades: logic synthesis, hardware description languages, formal verification, system-level design, and advanced process technologies.

Agentic AI may prove to be equally significant.

Yet unlike many previous automation technologies, AI introduces questions that extend beyond productivity. It forces the industry to confront issues of trust, explainability, accountability, workforce development, and engineering methodology.

The panel ultimately reached a broad consensus.

Agentic AI will become an essential component of semiconductor design and verification. The technology is already demonstrating value, and its capabilities continue to improve rapidly.

The challenge now is not building more powerful AI systems.

The challenge is building engineering processes that allow those systems to be trusted.

Also Read:

John Barr: The EDA Veteran and Award-Winning Needham Funds Portfolio Manager

SemiWiki Q&A with Julie Rogers, Executive Director, ESD Alliance

The Name Changes but the Vision Remains the Same – ESD Alliance Through the Years


CEO Interview with Suresh Vasudevan of Clockwork.io

CEO Interview with Suresh Vasudevan of Clockwork.io
by Daniel Nenni on 06-12-2026 at 6:00 pm

Suresh V

Suresh Vasudevan is CEO of Clockwork.io, pioneering Software-Driven AI Fabrics™ that recover the 60-80% of cluster capacity that today goes completely unutilized. Previously, he led Nimble Storage to IPO and HPE acquisition, and served as CEO of Sysdig. Prior to that, he was at NetApp and McKinsey & Co.

His focus: making every GPU hour count.

Tell us about your company.

Clockwork.io solves a fundamental bottleneck in AI compute – the gap between what organizations pay for and what they get. We deliver a hardware-agnostic layer that provides nanosecond observability, fault tolerance, and performance optimization across any accelerator, network, or deployment model. Our solution TorchPass is benchmarked by SemiAnalysis as the only solution maintaining full throughput during failures.

We were founded in 2018 on Stanford research by Balaji Prabhakar, Yilong Geng, and Chief Scientist Mendel Rosenblum (VMware co-founder), backed by Diane Greene, John Chambers, and Lip-Bu Tan.

What problems are you solving?

The bottleneck in AI isn’t how fast the GPU computes. It’s how well hundreds and thousands of them can work together. The fabric connecting them was never engineered to make that reliable.

AI training uses globally synchronous collective operations where every GPU rank must complete each step before any proceeds. A GPU falling off the bus, a memory XID error, a driver fault, a link flap, a NIC failure, a straggler, or an NCCL hang can crash the entire job. SemiAnalysis research finds the first failure on a new cluster happens within 26 minutes. Meta’s Llama 3 logged 466 interruptions over 54 days, each 8–24 engineering hours. In a 2,048-GPU cluster, this equates to $6.0M annually.

The result: most organizations convert only 20–25% of paid GPU capacity into useful work… what SemiAnalysis calls “goodput.” The rest is lost to failures and overhead. This is the problem we solve for – helping companies maximize their GPU capacity.

What application areas are your strongest?

Clockwork serves AI builders, including hyperscalers, enterprises, and research institutions – as well as GPU cloud operators.

For AI builders, TorchPass handles failures transparently, so teams need to checkpoint less often, enabling larger batch sizes, fewer OOM errors, and faster time to objective. The deeper shift: AI teams care whether their model finishes, not whether individual nodes are up. The meaningful metric isn’t availability percentage. It’s what fraction of failures are resolved without lost training progress.

For GPU cloud operators, Clockwork delivers faster commissioning, stronger SLAs, and zero-downtime maintenance. This means firmware updates while training continues. It enables operators to offer job-level continuity to tenants: committing not to node uptime but to whether training completes without rollback. This is a meaningful differentiator in a commoditizing GPU market.

Customers include Uber, NScale, Nebius, White Fiber, and DCAI (Denmark), which runs the Gefion supercomputer for quantum computing, drug discovery, and weather research.

What keeps your customers up at night?

Job crashes and checkpoint waste. A GPU falling off the bus, GPU stragglers, memory XID errors, driver faults, link flaps, NIC failures, hung ranks, or misconfigurations can crash a multi-week run. Every crash rolls back to checkpoint. These frequent saves mean high overhead, infrequent saves mean high rollback risk.

Opaque infrastructure. When any of these strikes, root cause requires hours of forensics across networking, compute, and ML teams. This is harder in heterogeneous clusters running RoCE v2, InfiniBand, or new transports like MRC with no unified cross-fabric view. Physical layer failures add to this: mismatched pluggables, firmware drift, and environmental factors produce gray failures invisible to standard checks.

Hardware obsolescence. GPU generations turn fast — Hopper to Blackwell to Vera Rubin — and vendor-specific fabric amplifies re-engineering costs each time.

ROI at scale. At 30% utilization, cost-per-GPU-hour is 3x what it should be. In NVL72 systems, a single NVLink backplane failure takes multiple trays offline with far less observability than scale-out.

What does the competitive landscape look like and how do you differentiate?

AI fabric management is a nascent category with no direct competitor. The closest analog is bespoke solutions built by frontier labs — SpaceX/xAI, Meta, Google — with massive resources and stacks tuned to their topology.

For everyone else, it’s been poor utilization or patching open-source tools. Clockwork.io is hardware-agnostic across NVIDIA and AMD compute, InfiniBand, Ethernet, and RoCE. Broadcom has stated Clockwork helps their platforms “realize their full potential”; AMD has endorsed Clockwork on the MI350X/ROCm stack. SemiAnalysis found TorchPass “the only option that maintains the same training performance as jobs without fault tolerance.”

Clockwork differentiates on three dimensions: hardware neutrality across TorchTitan, Megatron-LM, DeepSpeed, Slurm, and Kubernetes; nanosecond-precision telemetry via Global ClockSync; and stateful fault tolerance via TorchPass — live GPU migration from the exact failed step. Hardware neutrality hedges against inference fragmentation: as KV cache and prefill/decode add RDMA demands, vendor-specific stacks compound with each workload. Fibre Channel gave way to Ethernet for the same reason.

What new features/technology are you working on?

The roadmap targets “autonomic collective communications” – building a fabric that predicts failures, adapts routing, and self-optimizes. Three new FleetIQ capabilities ship this month: Metric-to-Detection Pipeline, Advanced Fleet Monitoring, and Advanced Workload Monitoring. Advanced Fleet Monitoring probes every NIC-to-switch link 100× per second with directional precision, surfacing gray failures that Round Trip Time (RTT) monitoring averages away. Advanced Workload Monitoring instruments at the NCCL layer to identify which rank stalled. TorchPass is evolving into a full training continuity platform — a tiered orchestration layer selecting the least disruptive response to failures under customer-defined policy. The core pillars — ClockSync, State Transfer, and Dynamic Traffic Control — are advancing from reactive to predictive.

How do customers normally engage with your company?

FleetIQ is 100% software overlay — no hardware changes. Customers begin with a free consultation or POC. SemiAnalysis benchmarking and TCO/Goodput calculators support technical evaluation.

As GPU systems grow denser and costlier, the ROI for software that recovers latent capacity grows with it.

clockwork.io — hello@clockwork.io

Also Read:

Q&A Interview with Mo Steinman, Lightelligence’s Senior Vice President and General Manager, U.S.

CEO Interview with Mike Horton CEO of HYFIX

How llmda.ai Coaxed Me Out of Retirement, an Interview with Kurt Shuler


Q&A Interview with Mo Steinman, Lightelligence’s Senior Vice President and General Manager, U.S.

Q&A Interview with Mo Steinman, Lightelligence’s Senior Vice President and General Manager, U.S.
by Daniel Nenni on 06-12-2026 at 4:00 pm

Maurice Steinman #2

Maurice (Mo) Steinman is Senior Vice President and General Manager, U.S. at Lightelligence. He took a few minutes out of a busy week to answer questions about Lightelligence, its optical solutions, application areas for its optical computing products and how it differentiates itself in this market.

Tell us about Lightelligence.

Lightelligence is an MIT spin-out founded in 2017. Our vision quite simply is to unleash the power of optical computing, fueled by the belief that a new paradigm is required with the computational and data demands of today’s workloads increasingly outpacing the capabilities of purely electronic systems. Since our founding, we have assembled a team of over 250 professionals across six locations worldwide working tirelessly to bring that vision to bear.

What problems is Lightelligence solving?

Our products are focused on two distinct spaces within the optical landscape: optical computing and optical interconnect fabrics.

Our optical computing products are built around an arithmetic processor that exploits the properties of light to perform vector-matrix multiplications at extremely low latency. This type of processor allows us to address single-node performance by accelerating this critical type of arithmetic operation which is prevalent in combinatorial problems and today’s most computationally-demanding workloads.

Our optical interconnect fabric products allow us to scale our customers’ system-wide performance beyond the single-node, by connecting computing elements at multiple different levels. Our solutions range from chip-to-chip, board-to-board, server-to-server, and rack-to-rack, using our optical interconnect fabric products ranging from pluggable modules to co-designed highly integrated optical engine solutions.

What application areas are Lightelligence’s products best suited?

Our optical computing products are best suited for applications that benefit from low-latency computation.  In our most recently released optical computing product, PACE2, we have integrated developer-friendly features including a software tool chain to facilitate algorithm exploration. We want to get this product into the hands of researchers who can identify algorithms that exploit the inherent hardware performance for maximum workload benefit.

Our customers are using our fabric products to scale performance across several different use cases.  In one scenario, we are enabling our customers to aggregate the performance of many compute nodes with an optical fabric that can span the physical breadth of the system, typically across multiple racks.  Optical fabrics are well-suited to build systems that are comprised of tightly coupled elements that are physically distant from each other.  Interconnect distances that would be unreachable using purely electronic interconnect are considered “short reach” for optical fabrics.

A related usage scenario involves systems requiring composability.  Such systems are disaggregated into distinct server elements and resource pools that can be scaled and assigned flexibly and efficiently.  In disaggregating the system, interfaces must be exposed and then connected to permit composing the full system.  The ability to manage and share resources across a larger number of servers requires an interconnect fabric that can span a physically large footprint in the data center and connect the exposed interfaces efficiently.  Optical interconnects are perfectly suited for the composable disaggregated infrastructure (CDI) market.

What does the competitive landscape look like and how does Lightelligence differentiate?

The landscape is rich with players vying to address one or more areas of optical technology. Rather than taking a component-level view of the market, all our products are developed with a customer use-case mindset and system-level approach. Our cross-functional team has developed a level of insight into the union of hardware and software for customer workloads that sets us apart from other players who employ a component-level approach. This allows us to engage more deeply with our customers and deliver solutions more smoothly and optimally.

What new features/technology is Lightelligence working on?

We continue to invest in future generations of optical processors, scaling matrix processing performance and increasing customer usability and adoption.  On the interconnect side, novel features such as Distributed Optical Circuit Switch (dOCS) allow our customers to build topology reconfiguration directly into the performance scale-up network.  We are developing Near-Package Optics (NPO) and co-packaged optics (CPO) solutions for deeper levels of optical integration with customer silicon.

What is the biggest growth area for 2026 and why?

In 2026 we are excited about our dOCS (distributed Optical Circuit Switch) technology.  This product neatly combines a switching function directly within the scale-up network in a deployment-friendly pluggable form-factor, which eases customer adoption.  Built-in switching within the fabric enhances system reliability, fault containment, and uptime by eliminating the single point of failure typically seen with a centralized switch function.  We are also looking toward increased adoption in the composable disaggregated infrastructure space with the onset of widely available PCIe 6.0 and CXL 3.0 hardware.

How do customers normally engage with Lightelligence?

Lightelligence utilizes a direct approach to customer engagement. Our business development and sales professionals work to match our technology and product solutions to customer needs. Our engagements range from deep co-development partnerships to customizations of standard products to utilizing products directly from stock. Our field applications engineers (FAEs) work with customers to facilitate trials on their systems and navigate the integration path to deployment.

For more details and updates on Lightelligence, SemiWiki readers should visit www.lightelligence.ai or follow us on Linkedin. Specific questions should be sent to: sales@lightelligence.ai

About Maurice Steinman

Maurice (Mo) Steinman is a Senior Vice President and General Manager, U.S. at Lightelligence. He has enjoyed a 40-year career in the semiconductor industry, working for such companies as Digital, Compaq, HP, Intel and AMD, where he held the title of Senior Fellow before joining Lightelligence. A veteran of many successful tape outs and product introductions, Steinman has expertise in SoC architecture, SoC interconnect, memory subsystems, and energy management. He graduated from Rensselaer Polytechnic Institute (RPI) with a Bachelor of Science degree in Computer and Systems Engineering and is named as an inventor on over 50 U.S. patents.

Also Read:

CEO Interview with Daniel Schall of Black Semiconductor

CEO Interview with Vivek Vishwakarma of ThirdAI Automation

CEO Interview with Vivek Raghunathan of Xscape Photonics


CEO Interview with Mike Horton CEO of HYFIX

CEO Interview with Mike Horton CEO of HYFIX
by Daniel Nenni on 06-12-2026 at 2:00 pm

Mike Headshot copie

Mike Horton is a co-founder of HYFIX Spatial Intelligence, which builds GNSS hardware for decentralized positioning and timing networks,  and GEODNET, the world’s largest decentralized GNSS reference network, and HYFIX Spatial Intelligence, which builds GNSS hardware for decentralized positioning and timing networks.

Tell us about your company?
HYFIX builds American-made chips for drones and autonomous robots. We started the company after spending years watching teams piece together autonomous systems from a pile of disconnected hardware. One vendor handled flight control, another handled GPS, another handled radios, and another handled compute. It works until it doesn’t. The entire stack ends up heavier, more power hungry, harder to secure, and harder to maintain than it should be. We’re replacing that patchwork with a single integrated platform designed specifically for autonomy.

What problems are you solving?
Right now, building a drone still feels more like systems integration than product development. Teams spend enormous amounts of time trying to make different components cooperate, and every additional vendor creates another failure point. If a supplier changes a module, disappears, or gets restricted, entire products can stall. That problem has become impossible to ignore as more of the industry depends on foreign-made hardware. We think the stack needs to get dramatically simpler, smaller, and more resilient if autonomous systems are actually going to scale.

What application areas are your strongest?
A lot of our work centers around drones and robotic systems that need reliable positioning and autonomy outside of perfect lab conditions. That includes inspection drones, mapping and surveying systems, agriculture, public safety, ISR platforms, and lightweight consumer drones. The interesting thing is that these markets all run into the same bottlenecks eventually. Power constraints, unreliable positioning, integration complexity, and systems failing in environments that are less forgiving than a demo field.

What keeps your customers up at night?
Supply chain dependence is a major one. A lot of companies realized they were far more exposed than they thought once geopolitical pressure and FCC restrictions started entering the conversation. Reliability is another. Many autonomous systems still rely on loosely connected stacks that were assembled over time rather than designed as one cohesive platform. Then there’s GPS. People used to treat degraded GPS environments as edge cases. They’re not edge cases anymore.

What does the competitive landscape look like and how do you differentiate?
Most of the industry still treats autonomy as a collection of separate subsystems. You source positioning from one company, radios from another, compute from somewhere else, and then your engineering team spends months trying to hold it all together. We took the opposite approach. We started with the assumption that these systems should have been integrated from the beginning. Our chip combines flight control, positioning, communications, and onboard intelligence into one platform, which cuts weight, lowers power consumption, and removes a lot of the integration pain that slows teams down today. Building it in the U.S. also matters much more now than it did even a few years ago.

What new features or technology are you working on?
Right now we’re focused on getting production-ready chips into customer systems and continuing to improve precision and resilience in difficult operating conditions. We’re integrating with GEODNET’s RTK network and exploring how emerging LEO satellite systems can improve reliability when traditional GPS becomes unstable. We’re also building a sub-250g reference drone because we wanted to demonstrate the platform in a real product instead of just talking about it in architecture diagrams.

How do customers normally engage with your company?
Most conversations start when a team gets tired of fighting its hardware stack. Sometimes they’re trying to reduce weight and power consumption. Sometimes they care about secure communications or GPS resilience. Sometimes they’re simply frustrated by how long integration takes. Usually there’s a moment where they realize they’re spending more time managing component complexity than actually building autonomous capabilities. That’s typically where we come in.

Also Read:

CEO Interview with Chuck Gershman of Owl Autonomous Imaging

CEO Interview with Mike Horton of HYFIX Spatial Intelligence

CEO Interview with Daniel Schall of Black Semiconductor


Podcast EP350: The Growing Threat of Hardware Security Breaches and What to do About it with Dr. Andreas Kuehlmann

Podcast EP350: The Growing Threat of Hardware Security Breaches and What to do About it with Dr. Andreas Kuehlmann
by Daniel Nenni on 06-12-2026 at 10:00 am

Daniel is joined by Dr. Andreas Kuehlmann, General Manager of Security Solutions at Arteris. He has over 35 years of experience in semiconductor design, software, and cybersecurity, including roles at IBM Research, UC Berkeley, Cadence, and Synopsys. Previously, he was CEO of Cycuity, which was acquired by Arteris.

Dan explores with Andreas the growing impact of AI-fueled cyberattacks on the world, covering both software and hardware attacks. Andreas explains some dynamics of software versus hardware security attacks. Open source design usage is far more prevalent in software than in hardware, which explains some of the differences. However, the world is changing, and hardware level attacks are becoming a larger threat. Andreas describes the situation’s dynamics in some detail. Regulatory influence will also play a role.

He explains that improving hardware security is a journey and offers a few steps to begin. The threats are real, and may arrive sooner than you think. This podcast offers valuable insights on how to prepare.

Arteris Cybersecurity Solutions

The views, thoughts, and opinions expressed in these podcasts belong solely to the speaker, and not to the speaker’s employer, organization, committee or any other group or individual.


How llmda.ai Coaxed Me Out of Retirement, an Interview with Kurt Shuler

How llmda.ai Coaxed Me Out of Retirement, an Interview with Kurt Shuler
by Daniel Nenni on 06-12-2026 at 6:00 am

Kurt Shuler

Arteris is one of the most impressive companies SemiWiki has worked with over the last fifteen years. We have collaborated on one hundred and seventy-three articles/podcasts that have garnered more than two million views/listens. The success of Arteris can be easily tracked to the executive team and Kurt Shuler was the executive we interfaced with. It was a pleasure to catch-up with Kurt since his retirement and collaborate with him once again on a new venture.

You retired to Spain after a long career in semiconductors and an exciting success at Arteris. Most people stay retired. What happened?

I blame an old friend from my Texas Instruments days.

I was thoroughly enjoying retirement in Valencia, Spain. I was investing in and evaluating startups, building furniture, learning to play bass and speak Spanish. And I have two kids in middle and high school, so I was never bored. Then, out of nowhere I got a message from a friend I’d worked with at TI years ago. The message was, “Hey, you have to look at this company. They’re solving the problem you’ve been complaining about for twenty years.”

My first reaction was skepticism. Right now, everyone with a large language model and a pitch deck claims to be revolutionizing chip design. But this friend from TI knew exactly what problem I’d been complaining about, because he’d heard me complain about it for two decades. So, I decided to take a look.

For context, my path here is a little unusual. I started as an air commando in Air Force Special Operations, then went to MIT Sloan and crossed over into tech. This includes Intel, Texas Instruments, a couple of startups that got acquired (Virtio by Synopsys, Tenison by ARC), and then twelve years leading marketing at Arteris, the network-on-chip IP company, through its IPO. I spent my whole semiconductor career around SoC integration, which means I spent my whole career watching what happens when the pieces don’t match.

One conversation with Nagesh Gupta and Mahesh Umasankar, the founders of llmda.ai, and I knew this wasn’t another “generate RTL faster” company. They were going after the thing that kills schedules. I was compelled to sign on as a strategic advisor.

So, what exactly is this problem you complained about for twenty years?

It’s the late-stage fire drill. Anyone who has shipped a chip knows it.

At Arteris I had a privileged, and slightly cursed vantage point. The network-on-chip touches every block in the SoC, so when any team’s spec drifted out of sync with reality, the wreckage tended to surface at integration. And when it surfaced, our phone rang.

The classic version goes like this. The customer’s project is humming along. Tape-out is in sight. Then validation trips over something small, say a register definition that the software team read one way and the design team implemented another way. Nobody was careless. The document was correct when it was written. But the design moved and the document didn’t, and now you’re debugging a mismatch in real silicon or scrambling months before tape-out. Schedules in this industry don’t slip by weeks; they slip by quarters.

What made it maddening is that I watched this for twenty years across every company, every geography, every process node. The implementation tools kept getting much better. But the fire drills never went away. Industry data backs up the gray hair: roughly three quarters of chip projects fall behind schedule, only 14 percent of chips in 2024 achieved first-silicon success, and the Wilson Research/Siemens studies consistently rank spec and documentation errors among the leading causes of functional bugs. A six-month slip costs about a third of a product’s lifetime revenue. That is not a tooling problem. That is a truth problem.

Why does this keep happening? The specs were presumably right at some point.

They were. Specs don’t start wrong, they rot.

A design is a living thing. Decisions get made in hallway conversations, in Slack threads, in a late-night ECO. The RTL moves on, and the spec, the user guide, the test plan, and the programmer’s reference manual fall behind at different rates. Each team keeps its own documents locally consistent, but the global picture quietly drifts apart. Nagesh has a quote I like: the question is not whether there will be drift, the question is when you’ll detect it. The cost of the fix is a function of how late that detection happens.

Then you multiply by scale. Even a modest SoC has fifty-plus people split across architecture, design, verification, physical design, board, and embedded software, usually spread across time zones, languages, and company boundaries in the supply chain. And the hardware/software interface is the killer part. Something like 84 percent of ASIC projects now include an embedded processor. That means tens of thousands of registers, and a tiny difference in one definition is all it takes.

Here’s the part I saw up close as the marketing guy who owned the datasheets and integration guides: lead engineers and architects burn up to 40 percent of their time creating and maintaining documents, and the documents still go stale. We were paying our most expensive people to do work that was both hated and, structurally, impossible to keep correct by hand.

Let’s talk about the founders. Who are Nagesh and Mahesh, and why did you believe them?

This is my favorite part of the story, because I’ve sat through a lot of AI pitches, and the difference here was that these two didn’t start with the technology. They started with the scar tissue.

Nagesh Gupta has spent about thirty years across HP, Cadence, Xilinx, and Lattice. He’s a serial entrepreneur with two exits: he founded Taray, which Cadence acquired, and Auviz Systems, which Xilinx acquired. He has lived this problem from the system and customer side his entire career. Mahesh Umasankar is the silicon execution side of the pairing: Intel’s FPGA group, Samsung, VMware-Broadcom, deep in CXL and hardware/software integration. Mahesh is the guy who has been on the receiving end of a bad spec at two in the morning before a tape-out.

So, you have two people who watched the same failure mode for decades from opposite seats. When generative AI matured, they didn’t see a chatbot. They saw the first technology that could hold an entire project’s artifacts consistent with each other, and they decided to build exactly that, and only that.

When I met them, they could finish my war stories before I did. That’s when I believed. There’s also a personal symmetry I enjoy. An old TI friend connected me to two founders who, like me, had spent twenty-plus years quietly furious about the same thing.

So, what does llmda.ai do about it?

The first thing to understand is what llmda is not trying to do. The design and implementation tools are not the problem. Today’s EDA flows are phenomenal at building what you tell them to build. The problem is the connective tissue around them: the specs, documentation, and collateral that tell the tools, the teams, and the customers what “correct” means.

llmda’s first product, llmda Spectra™, is an agentic documentation platform built specifically for semiconductor, hardware, and embedded software teams with the engineer always in the loop. It reads your actual design artifacts, the specs, register maps, RTL, and prior document versions, and drafts accurate technical documentation in minutes instead of weeks, generating 80 percent or more of a document automatically. Just as important, every section is traceable back to its sources, and the platform keeps documents consistent as the design changes underneath them. That last part is the whole ballgame. Anyone can generate a document once. Keeping a family of documents continuously true to a moving design is the hard problem.

And it slots into how teams already work, whatever their input and output formats, whether they’ve adopted AI design tools elsewhere. This is the way I think about it: Spectra hands back the 40 percent of time your best engineers spend on documents, and it converts documentation from a liability you discover late into an asset you can trust.

Couldn’t you just do this with Claude or ChatGPT?

I get this question constantly, and I’ll start by being honest: general-purpose LLMs are remarkable. I use them every day.

But a general-purpose LLM is like a brilliant intern with no memory of your project and no stake in consistency. It will happily write you a beautiful, confident, wrong register description today, and a slightly different beautiful, confident, wrong one tomorrow. For a blog post this might be OK. But for a programmer’s reference manual that a customer’s firmware team will code against, that’s how you manufacture the exact late-stage fire drills we’re trying to kill.

We recently spoke with the AI group at a major semiconductor company that had spent about six months evaluating documentation approaches, including building their own solution on top of generic models, which they priced at a couple million dollars. Their conclusion was blunt: generic LLMs alone are not the answer for engineering documentation.

The difference is the harness. llmda’s value isn’t text generation, it’s everything wrapped around it: models trained on hardware semantics, direct awareness of design artifacts, persistence across the project’s life, human-in-the-loop review gates, and continuous consistency checking across the entire document set. A point solution writes words. A purpose-built platform maintains truth.

Stepping back, why does this matter more than the latest AI design tool announcement?

Because speed without truth just gets you to the wrong answer faster.

Almost all the industry’s AI energy is going into building faster RTL generation, verification copilots, and debug assistants. All of it is real and useful, and I’m glad it exists. But every one of those tools assumes the artifacts feeding it are correct. If the spec is wrong, an AI-accelerated flow doesn’t save you. It accelerates the rework. Given an incorrect specification, today’s tools will build the wrong thing efficiently.

The economics make the priority obvious. Slips are measured in quarters. A six-month slip costs roughly a third of the potential revenue, and a post-tapeout fix runs from millions to nine figures. Trustworthy design artifacts are the foundation under every other AI investment a team makes. That’s why I think this is a bigger lever than any single tool, and it’s why I came out of retirement for it.

Any final thoughts?

Just that I find the whole thing slightly funny. I complained about this problem for twenty years, retired, moved to Spain, and then one message from an old TI friend un-retired me. My wife has opinions about how retirement is going.

If any of this sounds familiar, two things are worth your time. llmda’s webinar on engineering documentation goes live June 16, and the team will be at DAC in Long Beach in July, including a “Build versus Buy” panel that should be a lively one. Everything is at llmda.ai, and you can follow llmda on LinkedIn. If you’ve shipped a chip, these topics should resonate.

Build the right thing, fast and correct. That’s the whole message.

Also Read:

WEBINAR: Engineering Documentation is a Critical Source of Truth – Do You Know if it’s Accurate?

Podcast EP349: llmda.a’s Unique AI Fabric for Embedded Systems Development with Nagesh Gupta

Learn How llmda Uses Agentic AI to Generate Hardware Docs & Keep Them Consistent


The Memory Sector Is Becoming One of the Main Beneficiaries of the AI Boom

The Memory Sector Is Becoming One of the Main Beneficiaries of the AI Boom
by Daniel Nenni on 06-11-2026 at 10:00 am

The Memory Sector Is Becoming One of the Main Beneficiaries of the AI Boom

The explosive growth of artificial intelligence is transforming the semiconductor industry, and nowhere is this more evident than in the memory sector. AI training and inference workloads are fundamentally memory-intensive, driving unprecedented demand for advanced DRAM architectures, High Bandwidth Memory (HBM), and enterprise NAND storage. While GPUs from NVIDIA dominate headlines, the reality is that AI accelerators cannot function efficiently without massive amounts of high-performance memory tightly integrated into the compute architecture. As a result, memory vendors are emerging as some of the biggest long-term beneficiaries of the AI boom.

At the center of this transformation is HBM, a 3D-stacked DRAM technology that delivers significantly higher bandwidth and lower power consumption than conventional DDR memory. HBM uses through-silicon vias (TSVs) and advanced packaging techniques to vertically stack DRAM dies, enabling memory bandwidth measured in terabytes per second. AI accelerators such as NVIDIA’s H100 and upcoming Blackwell platforms depend heavily on HBM3 and HBM3E to feed data into thousands of parallel GPU cores during large language model (LLM) training.

This trend has dramatically altered the competitive dynamics of the memory market. SK hynix has emerged as the dominant supplier of HBM, reportedly securing a leading share of NVIDIA’s HBM3 and HBM3E supply chain. The company’s early investment in TSV technology, advanced packaging, and thermal management gave it a critical advantage as AI demand accelerated. SK hynix is now ramping HBM3E production aggressively and is expected to remain a key supplier for next-generation AI systems.

Samsung Electronics, the world’s largest memory manufacturer, is also investing heavily in HBM capacity and advanced packaging technologies. Samsung’s integrated semiconductor model—including logic, foundry, packaging, and memory—positions the company to compete aggressively in AI infrastructure. Although Samsung initially lagged behind SK hynix in HBM qualification for certain AI platforms, it remains a major long-term player due to its scale, process technology leadership, and ability to rapidly expand production.

Micron Technology has become another major AI beneficiary. Historically viewed as more cyclical and PC-dependent, Micron is now leveraging its advanced DRAM portfolio and HBM roadmap to gain exposure to hyperscale AI deployments. The company’s HBM3E products are being designed into next-generation AI accelerators, and management has repeatedly stated that HBM demand exceeds supply well into future production cycles. Micron’s strong position in enterprise DRAM and data center SSDs also gives it broad leverage to AI infrastructure spending.

AI workloads are increasing memory content per server at an extraordinary rate. Traditional cloud servers typically required several hundred gigabytes of DRAM, but AI servers equipped with multiple GPUs may contain several terabytes of high-bandwidth memory and DDR5 DRAM. A single NVIDIA HGX platform can contain eight GPUs connected with NVLink and supported by enormous pools of HBM. This architecture dramatically increases DRAM consumption per rack and boosts average selling prices for advanced memory products.

DDR5 adoption is also accelerating due to AI server deployments. Compared to DDR4, DDR5 provides higher bandwidth, improved power efficiency, and greater module density, all essential for data center AI workloads. Vendors including Samsung, SK hynix, and Micron are benefiting from the transition as hyperscalers upgrade infrastructure to support generative AI services.

Beyond DRAM, NAND flash suppliers are also positioned to benefit from AI expansion. Generative AI requires massive datasets for model training and inference, driving demand for high-capacity enterprise SSDs. AI data centers rely on fast storage systems to move and manage petabytes of structured and unstructured data. Companies such as Kioxia, Western Digital, Samsung, Micron, and Solidigm are therefore seeing growing demand for enterprise NAND solutions optimized for hyperscale environments.

Another critical technology trend is advanced packaging. AI accelerators increasingly use chiplet architectures and heterogeneous integration, where memory must be tightly coupled to compute dies. This creates opportunities not only for memory vendors but also for packaging leaders such as TSMC, Amkor, and ASE. CoWoS packaging capacity at TSMC has become particularly important because it enables integration of HBM stacks alongside AI GPUs and accelerators.

The AI boom is also reducing some of the historical cyclicality of the memory market. In the past, DRAM and NAND demand depended heavily on smartphones and PCs, leading to severe oversupply cycles. AI infrastructure spending introduces a new structural demand driver tied to hyperscale cloud expansion, enterprise AI adoption, and sovereign AI initiatives. This shift may support stronger long-term pricing and higher capital investment across the memory ecosystem.

Looking ahead, next-generation memory technologies including HBM4, MRAM, CXL-attached memory expansion, and processing-in-memory architectures could further reshape the industry. AI models continue to scale exponentially, requiring ever-larger memory pools and faster interconnects. As compute performance increasingly becomes constrained by memory bandwidth and latency rather than raw processing power, memory vendors are moving from supporting players to strategic enablers of the AI era.

Bottom line: The AI revolution is becoming as much a memory story as a compute story. Companies that can deliver high-bandwidth, low-power, and tightly integrated memory solutions are likely to capture a disproportionate share of semiconductor industry growth over the next decade.

Also Read:

WEBINAR: Engineering Documentation is a Critical Source of Truth – Do You Know if it’s Accurate?

WEBINAR: Caspia’s AI Makes You a Security Verification Expert

What Winemakers and Chip Designers Have in Common


WEBINAR: Engineering Documentation is a Critical Source of Truth – Do You Know if it’s Accurate?

WEBINAR: Engineering Documentation is a Critical Source of Truth – Do You Know if it’s Accurate?
by Mike Gianfagna on 06-11-2026 at 8:00 am

llmda homepage block (2)

Embedded systems programs rarely fail because of a lack of execution capability. They fail because critical engineering documentation drifts out of alignment over time and distance. Simply put, the team is correctly following the wrong instructions. This includes requirements, architecture, implementation, verification, hardware bring-up, firmware, and customer documentation. Local correctness does not guarantee lifecycle coherence.

llmda.ai will soon present an important webinar on this topic. The company delivers the AI fabric for embedded systems development, and the webinar will explore the dimensions of this important problem and how to bring all engineering artifacts back into focus. Below are some details of what the webinar will cover. A registration link is coming as well. As you read through this, you will realize engineering documentation is a critical source of truth – do you know if it’s accurate?

WATCH REPLAY HERE

The Webinar Presenter

Hal Conklin

The webinar will be presented by Hal Conklin, chief commercial officer for llmda.  Hal recently joined the company and brings a rich portfolio of accomplishments. Some of these include VP of sales at Arm, VP of sales and marketing at Carbon Design Systems, CEO and co-founder at BlueSteel Solutions, founder and VP of sales and marketing at CLK Design Automation, and VP of sales at Chrysalis Symbolic Design.

Hal began his career in channel marketing at Cadence. These and other experiences provide Hal with a deep understanding of embedded system design and the challenges that need to be overcome to achieve success.

Some of the Topics Covered

Hal will begin by explaining the critical importance of accurate engineering documentation. Topics covered here will include:

  • Deterministic Correctness: 100% alignment to source implementation details.
  • Source Traceability: Linking every spec to its original engineering artifact.
  • Repeatability: Consistent output across revisions and publishing pipelines.
  • Collaborative Governance: Structured review, approvals, and secure access controls.
  • Structured Outputs: Native support for Word, PDF, DITA, and ReST formats.

He will also discuss some of the challenges presented by general AI approaches. Generic LLMs are great to generate a first output, but they cannot be relied upon for production quality without significant added effort. He will explore this topic in some detail, describing the risks of generic AI and the benefits of purpose-built solutions such as those offered by llmda.ai.

He will then describe the llmda technology stack and how it provides the foundation for the llmda Agentic Consistency Platform. You will learn a lot about the risks and pitfalls of generic AI approaches. The graphic below summarizes some of the core architectural differentiators and cost realities associated with the llmda approach.

Core architectural differentiators and cost realities

Hal will then provide a live demonstration of llmda Spectra™, the first hardware-grounded agentic AI tool that compiles technically accurate documentation directly from engineering artifacts.

You will see the product in action so you can begin to understand its impact on documentation accuracy and overall system design quality and predictability.

To Learn More

This webinar treats a critical item for system design success, how to ensure you are building the right system from the start. If complex system design is part of your world, this webinar is a must-see event. It is being presented as a collaboration between llmda.ai and SemiWiki and will be held on June 16, 2026, at 10AM Pacific Time.  Engineering documentation is a critical source of truth, and this webinar will help you know if it’s accurate.

WATCH REPLAY HERE

Also Read:

Learn How llmda Uses Agentic AI to Generate Hardware Docs & Keep Them Consistent

Why Generic LLMs Fall Short for Critical Engineering Documentation

CEO Interview with Nagesh Gupta of llmda.ai


Technical Paper: FPGA Prototyping That Creates Useful PreSilicon Evidence

Technical Paper: FPGA Prototyping That Creates Useful PreSilicon Evidence
by Daniel Nenni on 06-11-2026 at 6:00 am

FPGA Prototyping Beyond RTL Fit Building Useful Pre Silicon Evidence

As semiconductor designs continue to grow in complexity, FPGA prototyping has become an essential component of modern pre-silicon validation strategies. While FPGA capacity and gate-count equivalence often dominate discussions around prototyping platforms, the true value of an FPGA prototype lies elsewhere: its ability to generate useful pre-silicon evidence that reduces tapeout risk and accelerates software and system validation.

A successful FPGA prototype is not simply one that compiles and fits into available FPGA resources. Rather, it is a running, observable, software-connected platform capable of answering critical engineering questions before silicon arrives. Can firmware boot? Can software drivers interact with hardware correctly? Can critical interfaces move realistic traffic? Can failures be observed and root-caused quickly? These questions determine whether a prototype contributes meaningful project value.

One of the primary challenges facing SoC development teams is the misconception that ASIC RTL can be directly migrated into FPGA implementation flows. In reality, ASIC RTL frequently contains structures that are poorly suited for FPGA architectures, including SRAM macros, custom register files, gated clocks, clock multiplexers, scan infrastructure, and technology-specific wrappers. These constructs can create synthesis instability, routing congestion, timing closure difficulties, and reduced operating frequency when left unmodified.

Consequently, FPGA readiness must be addressed early in the development cycle. Memory architectures require careful analysis to ensure that ASIC memory behavior maps appropriately onto FPGA resources such as BRAM, URAM, distributed RAM, or external memory models. Similarly, ASIC clock-gating methodologies must often be transformed into FPGA-friendly clock-enable structures and constrained clocking schemes. Early preparation of RTL significantly improves implementation predictability and reduces downstream engineering effort.

For large SoCs, multi-FPGA partitioning is often unavoidable. However, partitioning should not be viewed as a rescue mechanism for designs that exceed single-device capacity. Instead, it must be considered an architectural activity that balances software requirements, debug visibility, I/O connectivity, and performance goals. Poor partitioning decisions can introduce excessive inter-FPGA latency, bandwidth limitations, timing challenges, and increased debug complexity.

Another critical metric often overlooked in prototyping programs is Time-To-Waveform (TTW). Traditional measurements focus on compile time or achieved clock frequency. TTW, however, measures the time required to move from an RTL modification or debug hypothesis to observable system behavior. In large SoC projects, schedule delays rarely stem from a single failed compile. Instead, they arise from repeated cycles of implementation, bring-up, waveform capture, root-cause analysis, and recompile. Reducing TTW enables engineering teams to identify issues faster, validate fixes more efficiently, and maintain development momentum.

Debug visibility plays an equally important role. A prototype that executes software but lacks adequate observability may offer little practical value. Effective debug strategies require planning before implementation begins. Teams must determine which buses, interfaces, state machines, clock domains, and software-visible registers remain observable throughout the bring-up process. Preserving visibility minimizes disruptive recompiles and accelerates root-cause analysis.

Beyond hardware execution, modern FPGA prototypes must function as complete system platforms. Firmware teams require access to registers and memory subsystems. Driver developers need transaction-level interfaces and traffic generation capabilities. Validation engineers require realistic I/O environments. Host-to-DUT connectivity, reusable interface IP, memory models, and standardized runtime control mechanisms transform prototypes from isolated hardware platforms into productive software-development environments.

The Full Technical Paper is Available Here.

Bottom line: The most effective FPGA prototyping strategy focuses on evidence generation rather than capacity metrics. By combining prepared RTL, disciplined implementation flows, robust debug visibility, host connectivity, reusable infrastructure, and system-level validation capabilities, engineering teams can create prototypes that answer project-critical questions early enough to influence outcomes. In an era of increasingly complex SoCs and chiplet-based architectures, useful pre-silicon evidence—not FPGA capacity alone—has become the defining measure of prototyping success.

Also Read:

The “New Shift-Left”: Why FPGA Prototyping is the Ultimate RISC-V IP Sandbox

2026 Outlook with Ying J Chen of S2C

Accelerating Advanced FPGA-Based SoC Prototyping With S2C


Accelerating AI with RISC-V ISA

Accelerating AI with RISC-V ISA
by Daniel Nenni on 06-10-2026 at 2:00 pm

DGEMM

Lessons from Hands-On DGEMM Benchmarking

Using cycle-accurate simulation to explore how RISC-V vector extensions accelerate one of computing’s most important workloads

1. Why Vector Performance Matters

While GPUs dominate large-scale model training, CPUs execute a vast amount of matrix math in inference pipelines, data preprocessing, and scientific computing. Basic Linear Algebra Subprograms (BLAS) libraries underpin many scientific and machine learning frameworks, and one of their most important routines is GEMM — General Matrix Multiply. DGEMM (double-precision GEMM) is the gold-standard benchmark for this class of workload: it simultaneously stresses floating-point throughput, vector execution, register utilization, and memory bandwidth, making it a highly representative proxy for real-world compute intensity.

In this article we share practical lessons from hands-on experiments implementing a DGEMM kernel on an Andes AX46MPV near cycle-accurate RISC-V simulator. A cycle-accurate simulator lets us observe performance at the granularity of individual processor cycles and instrument the code with hardware performance monitoring (HPM) counters to track cycle counts, instruction throughput, and memory activity — all without silicon.

Our experiments confirm that the RISC-V Vector Extension (RVV) can dramatically improve performance with modest coding effort. A minimal RVV port delivered a 12× speedup over a scalar baseline; progressive tuning pushed that to over 150×; and enabling High Bandwidth Vector Memory (HVM) feature unlocked 275× — reaching 92.8% of theoretical peak efficiency. These gains stem from RVV’s scalable design, and the lessons learned along the way reveal how vector parameters interact in ways that are not always intuitive.

2. RISC-V Vector Extension (RVV) Primer

Unlike fixed-width vector ISAs such as x86 AVX-512 or Arm NEON, RVV uses a scalable vector model: the hardware determines the physical vector register width and software adapts at runtime. While Arm SVE is also a scalable architecture, RVV is more flexible in both scalability and software-controlled tunability. The same binary runs efficiently across a wide range of RISC-V hardware without recompilation.

Three parameters govern how vector operations behave:

  • VLEN — the physical width of a vector register in bits (e.g. 512 or 1024). A 1024-bit register holds 16 double-precision (FP64) elements.
  • SEW (Selected Element Width) — the element size used for each operation, chosen at runtime. For DGEMM, we use SEW=64 (FP64).
  • LMUL (Length MULtiplier) — a register-grouping multiplier. LMUL=4 fuses four physical registers into one logical vector, quadrupling capacity but proportionally reducing the number of independent logical registers available. With 32 physical registers, LMUL=4 yields 8 logical register groups.

Experimenting with these “knobs” taught us that increasing vector capacity always involves trade-offs. The interactions between VLEN, LMUL, and loop structure can produce counter-intuitive results, as the experiments below illustrate.

What is HVM?

High Bandwidth Vector Memory (HVM) is a microarchitectural feature that addresses the bottleneck that is the data path between the cache and the vector register file. In a standard RVV implementation, vector loads travel through a 512-bit wide cache data bus. At VLEN=1024 with SEW=64, each vector register holds 16 FP64 elements — 1024 bits — so filling it requires two sequential 512-bit transfers. HVM provides a dedicated 1024-bit wide memory path to the vector register file, wide enough to deliver a full 1024-bit vector register in a single transfer.

HVM is transparent to software: no changes to the RVV binary are required. The same kernel benefits automatically when HVM is enabled. This makes it an especially attractive feature for memory-bandwidth-bound workloads like DGEMM, where vector loads are on the critical path.

3. DGEMM: What We Are Measuring

DGEMM computes C = αAB + βC where A is M×K, B is K×N, and C is M×N. The computational cost is 2×M×N×K FLOPs (one multiply and one add per element pair). For our 64×64 test matrices that is approximately 524K floating-point operations per kernel call.

Theoretical peak throughput for the AX46MPV at VLEN=1024, SEW=64 is 64 FLOPs/cycle. This reflects the processor’s dual VFMACC capability: two VFMACC operations issued per cycle, each operating on 16 FP64 elements (2 FLOPs each), giving 2 × 16 × 2 = 64 FLOPs/cycle.

We instrumented the kernel with HPM counters to capture cycle count, instruction count, and memory activity. The scalar baseline was compared with a naïve RVV version and progressively tuned implementations, systematically varying VLEN, LMUL, and loop-blocking structure.

4. Scalar vs. RVV: The Impact of Vectorization

We implemented four progressively optimized versions of DGEMM. The results are summarized in Table 1.

Table 1. Performance summary across implementation stages. Peak = 64 FLOPs/cycle (VLEN=1024, SEW=64, AX46MPV).

Even the most naïve RVV port — a straightforward translation of the scalar loop using three RVV primitives — delivered a 12× speedup with minimal effort. The inner loop relies on just three instructions:

  • vsetvl — sets the active vector length for the hardware; the same code runs on any VLEN-capable core
  • vle64 — loads a vector of FP64 values from memory into a register group
  • vfmacc — fused multiply-accumulate across a full vector in a single instruction

Subsequent tuning — loop blocking and multiple accumulators — delivered a further 13× on top of the naïve RVV result, reaching 152× over scalar. Enabling HVM then added another 1.8× to reach the peak result of 275×. The key insight is that these gains are cumulative and largely independent: vectorize first, tune the register structure second, then exploit microarchitectural features like HVM third.

5. Counter-Intuitive Performance Lessons

Systematic parameter sweeps revealed three lessons that initially seem counterintuitive.

Lesson 1: Vectorize First

Do not assume that a modern optimizing compiler targeting a vector-capable processor will auto-vectorize hot loops. Despite compiling with -O3 and a vector-capable target, the compiler produced a scalar binary indistinguishable in cycle count from a build with auto-vectorization explicitly disabled.

Writing even a naïve RVV kernel using the three primitives described above immediately yielded 12×. The lesson is that even a straightforward manual RVV port is transformative, and becomes the foundation for all further optimization.

Lesson 2: The Register Budget Cliff

Increasing LMUL raises the number of elements processed per instruction, which sounds unambiguously good. However, LMUL also consumes physical vector registers. With 32 physical registers, LMUL=4 provides 8 logical register groups; LMUL=8 provides only 4. A kernel that maintains multiple independent accumulators — essential for hiding FMA pipeline latency — requires a budget of registers for both the accumulators and the live data vectors.

When that budget is exceeded, the compiler must spill registers to memory and reload them, replacing fast FMA throughput with expensive load/store traffic. Table 2 shows the cliff in practice.

Table 2. Register spill cliff: LMUL=8 collapses performance regardless of whether HVM is enabled.

Two observations stand out. First, the performance collapse at LMUL=8 is severe — 7.8× slower without HVM and 18.5× slower with HVM — making this one of the most consequential single-parameter choices in the sweep. Second, and importantly, HVM does not rescue LMUL=8. HVM widens the memory bandwidth path to the vector register file; it does not add physical registers. The register budget constraint is a fundamental microarchitectural limit, not a software artifact.

The practical rule: for a 64×64 DGEMM kernel, LMUL=4 with 4–6 accumulator rows is the sweet spot. LMUL=4 provides enough logical registers (8 groups) to sustain high accumulator parallelism while keeping all live vectors within the 32-register budget.

Lesson 3: HVM and the Importance of Matching VLEN to the Data Bus

HVM provides a substantial further gain when VLEN=1024 and the register budget is managed correctly. Using cache, only 512-bits of data can be transferred per vector load however with HVM each vector load can transfer 1024-bits per cycle. Table 3 illustrates that without HVM, increasing VLEN from 512 to 1024 provides a 30% improvement (23,892 to 15,979 cycles) whereas with HVM the performance doubles (17,812 to 8,831 cycles).

Table 3 also illustrates why “efficiency relative to theoretical peak” requires careful interpretation. The VLEN=512 no-HVM configuration shows 68.6% efficiency — higher than the VLEN=1024 no-HVM result of 51.3% — yet the VLEN=1024 configuration is faster in absolute terms (15,979 vs. 23,892 cycles). The theoretical peak also doubles with the vector length (VLEN), so a configuration that gains less than 2× when doubling VLEN will show a drop in efficiency percentage even while improving absolute throughput.

Table 3. HVM delivers its full benefit only when VLEN matches the 1024-bit HVM data bus.

Finding the Blocking Sweet Spot

With HVM enabled at VLEN=1024, LMUL=4, we swept from 2 to 7 rows to find the accumulator count that best hides FMA latency within the available register budget. Table 4 shows a clean progression with a peak at 6-row blocking.

Table 4. HVM blocking sweep (VLEN=1024, LMUL=4). Performance peaks at 6-row blocking and slightly reverses at 7-row.

The slight reversal at 7-row reflects the same register pressure dynamic seen in Lesson 2: adding a seventh accumulator row begins to crowd out the load registers, introducing minor spill overhead. The 6-row optimum represents the point at which the kernel fully hides FMA pipeline latency without exceeding the register budget.

What is row-blocking?

Row-blocking (also called accumulator unrolling or kernel unrolling in GEMM literature) is a specific application of loop unrolling applied to the output rows of a matrix multiply kernel. In a naïve implementation, the inner loop computes one row of the output matrix at a time, loading a single accumulator register and issuing one FMA per iteration. Row-blocking instead computes N output rows simultaneously within the same inner loop body, holding N independent accumulator vectors live across the loop.

Row-blocking specifically targets accumulator independence to keep enough independent work in flight to fully pipeline the FMA units. The optimal blocking factor is a function of FMA latency, issue width, and available register budget.

6. Conclusions

Our experiments confirm that RVV vectorization is transformative for compute-intensive workloads — and that scalar to near-peak efficiency is accessible with modest effort. The practical takeaways are:

  • Vectorize explicitly and early. The compiler will not auto-vectorize complex kernels like DGEMM even with -O3 and a vector-capable target. A naïve RVV port using vsetvl, vle64, and vfmacc immediately delivers 12× over scalar. Subsequent tuning then compounds that gain.
  • Be aware of the 32-register budget. LMUL and accumulator count jointly consume physical vector registers. Exceeding the budget triggers spills that can slow execution by 8–18× — a larger penalty than most other single-parameter mistakes. Stay within budget and LMUL=4 with 4–6 accumulator rows is typically the safe operating region.
  • Consider features like HVM to achieve higher memory bandwidth. Wider vectors are only faster if the memory path can sustain the bandwidth. On the AX46MPV, HVM provides a dedicated 1024-bit data path to match VLEN=1024. At VLEN=512, the standard bus already handles a vector register in a single transfer.
  • Track absolute throughput alongside efficiency ratios. A drop in percentage efficiency when increasing VLEN does not mean performance got worse — it may mean the theoretical ceiling scaled faster than you could follow.
  • Cycle-accurate simulation with HPM counters is a powerful development tool. All of the above was characterized without silicon, enabling rapid iteration over a large parameter space before tape-out.

RISC-V’s scalable vector model, combined with microarchitectural features like HVM, delivers a compelling path to near-peak floating-point efficiency on demanding workloads. The same binary runs across the full range of VLEN-capable cores; tuning effort is focused on a small number of well-understood parameters; and the gains compound in a predictable way. For teams targeting AI inference, HPC, or any compute-intensive workload on RISC-V, the message is clear: explicit RVV programming is both necessary and highly rewarding.

About the AX46MPV

The Andes AX46MPV is a high-performance RISC-V application processor implementing the RVV 1.0 vector extension with configurable VLEN (up to 1024 bits). The near cycle-accurate simulator used in this study models the processor’s pipeline, vector unit, cache hierarchy, and HPM counters with sufficient fidelity for architectural performance analysis.

Also Read: