Bronco Webinar 800x100 1

Revolutionizing Hardware Design Debugging with Time Travel Technology

Revolutionizing Hardware Design Debugging with Time Travel Technology
by Daniel Nenni on 01-02-2026 at 6:00 am

DVCon Europe 2025 Undo.io

In the semiconductor industry High-Level Synthesis (HLS) and SystemC have become essential tools, allowing engineers to model complex hardware designs using familiar C/C++ constructs. Yet, despite the widespread adoption of these languages, the debugging workflows in hardware development lag far behind those in software engineering. Traditional methods rely heavily on print statements, logs, waveform viewers, and iterative trial-and-error, often leading to frustration when bugs appear intermittently or in third-party libraries. This is where time travel debugging changes everything.

Time travel debugging, as pioneered by tools like Undo, introduces a powerful paradigm: record, replay, and resolve. Instead of repeatedly rerunning a failing simulation in hopes of reproducing a bug, engineers record the entire execution of a Linux process from the process level down to individual CPU instructions. This recording captures every deterministic and nondeterministic event, including system calls, I/O, timing functions, and multithreaded interactions. Once a crash or failure occurs, the tool automatically stops recording, preserving the exact state at the point of failure.

The magic happens during replay. Engineers load the portable recording into the debugger and navigate freely forwards and backwards in time. If a crash is at the end of the recording, simply jump there and step backward from symptom to root cause. Traditional forward-only debuggers like GDB force users to restart runs repeatedly, but time travel eliminates guesswork. Commands mirror GDB’s familiar syntax with reverse counterparts: reverse-step, reverse-next, reverse-finish, and reverse-continue. A particularly powerful feature is “last,” which instantly jumps to the exact moment a variable or memory location was last modified—ideal for tracking memory corruption or race conditions.

In a live demonstration involving a SystemC testbench with multiple libraries, a subtle off-by-one error caused a failure: code intended to read the zero-index bit of a string incorrectly accessed the first bit, yielding garbage output. Using the recording, an AI assistant (Claude) interfaced with Undo via a custom library, autonomously navigated backward, set bookmarks, executed reverse commands, and pinpointed the exact faulty array access in minutes—without any manual intervention.

This approach shines in complex scenarios common to hardware modeling:
  • Race conditions: Multithreaded SystemC simulations often exhibit nondeterministic behavior. The “last” command, combined with reverse-continue, reveals which thread overwrote shared memory and when, exposing missing locks without recompilation.
  • Deadlocks: Recordings capture all thread states, allowing engineers to trace blocking calls across time.
  • Intermittent failures: By integrating recording into regression pipelines, failing tests automatically generate recordings only when assertions fail, ensuring reproducible evidence is ready the next morning.

Undo also addresses hardware engineers’ needs with a waveform viewer that generates standard waveforms from recordings. Clicking any signal jumps directly to the corresponding source code line in the debugger, bridging the gap between high-level C++ models and low-level signal behavior.

Performance overhead is minimal for computational code, often near full speed, though I/O-heavy or highly nondeterministic workloads incur some slowdown due to logging external inputs. The tool requires Linux with modern Intel/AMD processors but needs no code changes, debug builds, or instrumentation.

Compared to alternatives like the open-source rr project (great for academic use) or Microsoft’s Time Travel Debugging in Visual Studio, Undo offers production-grade reliability, multithreading support, and seamless integration with modern EDA workflows.

Engineers report that debugging complex SystemC models traditionally takes at least a day, often involving consultations with library vendors or code owners. Time travel debugging reduces this by 4x or more, democratizing debugging: junior engineers can trace issues in unfamiliar codebases by simply following data flow backward. This accelerates verification, improves coverage, shortens time-to-market, and preserves team sanity.

Bottom line: In an industry racing toward ever-more-complex designs, adopting time travel debugging isn’t just an upgrade, it’s a necessity. Tools like Undo bring software’s most powerful debugging techniques to hardware, empowering engineers to resolve bugs faster, more reliably, and with less frustration.

Contact Undo

Also Read:

Taming Concurrency: A New Era of Debugging Multithreaded Code

Video EP7: The impact of Undo’s Time Travel Debugging with Greg Law

CEO Interview with Dr Greg Law of Undo


Addressing Silent Data Corruption (SDC) with In-System Embedded Deterministic Testing

Addressing Silent Data Corruption (SDC) with In-System Embedded Deterministic Testing
by Daniel Nenni on 01-01-2026 at 10:00 am

Siemens Broadcom TSMC OIP2025 SemiWiki

Silent Data Corruption (SDC) represents a critical challenge in modern semiconductor design, particularly in high-performance computing environments like AI data centers. As highlighted in a collaborative presentation by Broadcom Inc. and Siemens EDA at the 2025 TSMC OIP event, SDC occurs when hardware defects cause erroneous computations without triggering detectable errors, leading to subtle yet devastating failures. In one customer experiment involving a 54-day training run on 16,384 GPUs, 419 unexpected interruptions were reported, with 6 attributed directly to SDC. Though rare, accounting for about 1.4% of fails, these incidents can disrupt mission-critical operations, such as AI model training, where reliability is paramount.

The presentation underscores the industry-wide nature of SDC, driven by shrinking process nodes and increasing chip complexity. Defects that evade manufacturing tests may manifest in-field due to aging, voltage fluctuations, or thermal stress. Traditional testing methods fall short here, as they require device removal for diagnostics, which is impractical in deployed systems. To combat this, the teams advocate for in-system testing capabilities that allow periodic checks without downtime. Running ATPG patterns directly in the field detects latent defects that could precipitate SDC, ensuring system integrity. For AI applications, this means integrating test suites that can be executed routinely, preventing costly interruptions. Moreover, new patterns tailored to SDC can be deployed remotely, extending device lifespan without physical intervention.

Siemens’ In-System Test (IST) solution emerges as a key enabler. Built on the Streaming Scan Network (SSN), IST interfaces with embedded deterministic test (EDT) structures to deliver ATPG patterns efficiently. The IST controller drives the SSN’s parallel interface, supporting high-bandwidth data transfer via protocols like APB or AXI. In Broadcom’s implementation, IST was adapted for an EDT-based design with a Streaming Scan Host at the chip level. The controller resides at the top level, loading patterns into local SRAM via an on-chip CPU. Block-level EDT patterns, originally for production testing, are retargeted to IST inputs, allowing selective testing of targeted blocks while maintaining functional operation elsewhere.

Implementation brought several design challenges to the fore. Functional isolation is paramount: “functional” blocks (e.g., CPU subsystems) must remain active to load and execute IST operations, while “targeted” blocks switch to scan mode for testing. This requires isolating scan inputs to prevent interference. All functional block inputs that could disrupt IST, such as interrupts or AXI signals, must be held in a “quiet” state. Outputs from targeted blocks, which toggle during capture, are gated to avoid propagating noise. Broadcom addressed this by inserting isolation blocks and enabling Test Data Registers for control.

Clock splitting posed another hurdle. Broadcom’s methodology places On-Chip Clock controllers (OCC) at the chip top due to custom clocking. Functional blocks need free-running clocks, but targeted ones require OCC activation for scan shifts. Solutions included branching pre-OCC clocks for functional paths or adding secondary OCCs for targeted branches, ensuring synchronized yet independent clock domains.

Verification and Static Timing Analysis added complexity. Typically, STA modes separate functional and Design-for-Test (DFT) paths, but IST demands a hybrid “merged” mode where some blocks are functional and others in DFT. The Siemens tool provides verification collaterals like transaction files, C code, and SystemVerilog tasks for Design Verification (DV) environments. Testing occurs on post-DFT netlists, incorporating boot sequences, which extends runtime. Close collaboration between DV and DFT teams was essential for deliverables and debugging handshakes.

Results from the APB-based IST implementation demonstrate feasibility. With a 32-bit wide subordinate interface and SSN data bus, hardware overhead was modest: the IST Controller (ISTC) added 200 flops and 5,000 normalized combinational logic units, while SSH contributed 1,000 flops and 30,000 units. Five intest modes were run for 2,500 patterns, using 2 MB on-chip SRAM (about 0.5 million 32-bit words). Pattern storage ranged from 165,000 to 260,000 words per mode, with counts of 22-35 patterns. Overall, ~1.9 million 32-bit words were managed, with 4 loads per mode, showcasing efficient compression and bandwidth utilization.

Bottom line: The collaboration between Broadcom and Siemens highlights IST’s role in mitigating SDC through in-field testing. Despite challenges in isolation, clocking, and verification, the solution was successfully implemented and verified in DFT and DV setups. Future efforts will extend to AXI-based IST, promising broader adoption. This approach not only enhances reliability in AI and hyperscale environments but also reduces field failures, underscoring the value of embedded deterministic testing in next-generation silicon.

Also Read:

Podcast EP323: How to Address the Challenges of 3DIC Design with John Ferguson

3D ESD verification: Tackling new challenges in advanced IC design

Signal Integrity Verification Using SPICE and IBIS-AMI


TSMC’s 6th ESG AWARD Receives over 5,800 Proposals, Igniting Sustainability Passion

TSMC’s 6th ESG AWARD Receives over 5,800 Proposals, Igniting Sustainability Passion
by Daniel Nenni on 01-01-2026 at 6:00 am

TSMC ESG Award Ceremony 2025

Taiwan Semiconductor Manufacturing Company has once again demonstrated its leadership in corporate sustainability with the successful conclusion of its 6th ESG AWARD, which attracted more than 5,800 proposals from employees across the organization. The overwhelming response reflects not only TSMC’s strong internal engagement but also the growing momentum of environmental, social, and governance (ESG) values within the global semiconductor industry.

Launched as a platform to encourage employee participation in sustainable innovation, the ESG AWARD has become one of TSMC’s most influential internal initiatives. The sixth edition recorded a significant increase in submissions compared to previous years, highlighting how sustainability has evolved from a corporate objective into a shared mission embraced by employees at all levels. Proposals covered a wide range of topics, including energy efficiency, carbon reduction, water resource management, waste minimization, supply chain responsibility, workplace well-being, and community engagement.

TSMC emphasized that the award is not merely a competition, but a catalyst for turning ideas into action. Many past award-winning proposals have been successfully implemented across fabs and offices, delivering measurable environmental and social benefits. These include innovations in energy-saving manufacturing processes, circular economy practices for materials reuse, and digital solutions to enhance operational transparency and governance. By empowering employees to contribute ideas directly linked to real-world impact, TSMC reinforces a culture where sustainability is embedded into daily operations.

The strong participation in the 6th ESG AWARD also reflects the broader pressures and responsibilities facing semiconductor manufacturers today. As demand for advanced chips grows alongside global digital transformation, the industry’s environmental footprint has come under increasing scrutiny. High energy consumption, water usage, and complex supply chains pose challenges that require both technological innovation and organizational commitment. TSMC’s approach demonstrates how internal engagement can play a crucial role in addressing these challenges proactively.

According to TSMC, proposals submitted this year showed greater maturity and cross-functional collaboration than in previous editions. Many teams combined technical expertise with ESG thinking, proposing solutions that balance productivity, cost efficiency, and sustainability. This shift suggests that ESG considerations are no longer treated as separate from core business goals, but rather as integral to long-term competitiveness and resilience.
The award process includes rigorous evaluation criteria, focusing on innovation, feasibility, scalability, and alignment with TSMC’s sustainability strategy. Selected proposals receive recognition and resources to support further development and implementation. This mechanism not only motivates employees but also accelerates the company’s progress toward its ESG targets, including net-zero ambitions and responsible supply chain management.

Beyond internal impact, the ESG AWARD sends a strong signal to stakeholders, including customers, investors, and partners. It highlights TSMC’s commitment to transparency, accountability, and continuous improvement in ESG performance. In an era where ESG metrics increasingly influence investment decisions and customer trust, such initiatives strengthen TSMC’s reputation as a responsible industry leader.

The enthusiasm generated by the 6th ESG AWARD underscores a key lesson for global corporations: sustainability thrives when employees are empowered to participate meaningfully.

Bottom Line: By transforming ESG from a top-down directive into a bottom-up movement, TSMC has ignited a passion that extends beyond awards and recognition. As the company looks ahead, the ideas and energy unleashed by this year’s record-breaking participation are expected to play a vital role in shaping a more sustainable future for both TSMC and the semiconductor industry as a whole.

Also Read:

TSMC based 3D Chips: Socionext Achieves Two Successful Tape-Outs in Just Seven Months!

Why TSMC is Known as the Trusted Foundry

TSMC’s Customized Technical Documentation Platform Enhances Customer Experience


Tiling Support in SiFive’s AI/ML Software Stack for RISC-V Vector-Matrix Extension

Tiling Support in SiFive’s AI/ML Software Stack for RISC-V Vector-Matrix Extension
by Daniel Nenni on 12-31-2025 at 10:00 am

SiFive AI ML RISC V Summit 2025

At the 2025 RISC-V Summit North America, Min Hsu, Staff Compiler Engineer at SiFive, presented on enhancing tiling support within SiFive’s AI/ML software stack for the RISC-V Vector-Matrix Extension (VME). This extension aims to boost matrix multiplication efficiency, a cornerstone of AI workloads. SiFive’s VME implementation introduces a large matrix accumulator state for the result matrix C, leveraging existing RISC-V Vector (RVV) registers to supply source operands A and B. This design enables outer-product-style multiplications directly into the C accumulator, with options for “fat” k>1 support to handle narrower input datatypes. Rows or columns of C can be moved to vector registers or loaded/stored from memory, and the C state may be segmented into multiple tiles. By positioning the accumulator near arithmetic units, the matrix engine achieves high throughput, making it ideal for compute-intensive AI tasks.

A key focus was tiled matrix multiplication, illustrated through a Python pseudocode example. The function tiled_matmul decomposes large matrices A (m x k), B (k x n), and C (m x n) into manageable tiles. Outer loops iterate over tile_m, tile_n, and tile_k dimensions, creating views of sub-matrices (e.g., lhs_tile = A[m1:m1+tile_m, k1:k1+tile_k]). Inner loops then apply register-level tiling with tile_m_v, tile_n_v, and tile_k_v, performing the core operation: dst_tile[mv:mv+tile_m_v, nv:nv+tile_n_v] += np.matmul(lhs_tile_v, rhs_tile_v). This hierarchical tiling optimizes data locality—outer tiles fit into caches, inner ones into registers—reducing memory access overhead and enhancing performance for large-scale AI models.

SiFive’s AI/ML software stack integrates these hardware features seamlessly, enabling end-to-end execution of high-profile models on SiFive platforms. Central to this is the Intermediate Representation Execution Environment (IREE), an open-source MLIR-based compiler and runtime optimized for SiFive microarchitectures. IREE supports diverse front-ends like PyTorch for LLMs, applying target-specific tiling policies to break down operations. It enables intra-operation parallelization, generates code via SiFive’s tuned LLVM compilers and Scalable Kernel Libraries (SKL), and mixes MLIR codegen with microkernels (ukernels) for efficiency. The runtime handles inter-operation parallelization through asynchronous execution and task scheduling, supporting both Linux and bare-metal environments.

Hsu highlighted advancements in multi-tile matrix multiplication within IREE. Previously, IREE supported only single-tile K-loops, where sources A0 and B0 are loaded once, and a single matmul accumulates into C00. Now, enhancements allow multi-tile K-loops, loading sources like A0, A1 once and distributing accumulations across multiple C tiles (e.g., C00 += A0 * B0, C10 += A1 * B0, then C01 += A0 * B1, C11 += A1 * B1). This reduces redundant loads, improving arithmetic intensity and efficiency, especially for deep neural networks where K dimensions are large.

In takeaways, Hsu emphasized that tiled matrix multiplication is essential for high-performance AI/ML applications, as it maximizes hardware utilization. IREE excels in automating and optimizing these tiling strategies. RISC-V’s VME is purpose-built for such tiled operations, delivering native performance gains. SiFive’s XM series implements VME in a compact, integrated form factor, and the team’s contributions to IREE—particularly multi-tile support—further amplify efficiency. This software-hardware synergy positions SiFive’s stack as a robust solution for AI acceleration on RISC-V, bridging custom extensions with standardized ecosystems to drive innovation in edge and datacenter AI.

Bottom line: The presentation underscores SiFive’s commitment to advancing RISC-V for AI, combining architectural extensions with sophisticated compiler tools to tackle compute bottlenecks effectively.

Also Read:

SiFive Launches Second-Generation Intelligence Family of RISC-V Cores

Podcast EP197: A Tour of the RISC-V Movement and SiFive’s Contributions with Jack Kang

Enhancing RISC-V Vector Extensions to Accelerate Performance on ML Workloads


TSMC based 3D Chips: Socionext Achieves Two Successful Tape-Outs in Just Seven Months!

TSMC based 3D Chips: Socionext Achieves Two Successful Tape-Outs in Just Seven Months!
by Daniel Nenni on 12-31-2025 at 6:00 am

Synopsys Socionext 3d IC

Socionext’s recent run of rapid 3D-IC tape-outs is a noteworthy milestone for the industry with two successful tape-outs in just seven months for complex, multi-die designs aimed at AI and HPC workloads. That pace of iteration highlights how advanced packaging, richer EDA toolchains, and closer foundry-ecosystem collaboration are turning what used to be multi-year projects into achievable, repeatable engineering cycles.

At the heart of this acceleration are three interlocking trends: face-to-face 3D stacking that shrinks inter-die latency, process-node specialization across dies (e.g., TSMC N3 compute plus TSMC N5 I/O), and EDA/IP/cloud toolchains purpose-built for multi-die flows. Socionext’s taped-out designs reportedly combine an N3 compute die with an N5 I/O die using TSMC’s SoIC-X 3D stacking, a configuration that reduces interconnect distance and power while increasing bandwidth versus traditional 2D or 2.5D approaches.

Speeding a 3D-IC from concept to tape-out requires more than just clever floorplanning. Mechanical and thermal challenges (warpage, delamination, and heat removal), stringent reliability checks, and new timing/IR signoff flows make multi-die design complex. Socionext’s achievement illustrates how tightly integrated IP (PHYs, SerDes), 3D-aware design rules, and cloud-enabled EDA can remove bottlenecks: by automating design-rule checks for stacked interfaces, enabling distributed compute for large signoff runs, and providing pre-verified IP blocks that support high-speed interconnects. The company itself and partners emphasize that combining proven IP with AI-augmented EDA flows shortened development cycles and improved first-pass quality.

From a product perspective, 3D stacking supports an attractive value proposition for AI and HPC: put logic where it matters, optimize each die on the best process node for that function, and connect them with ultra-dense interfaces to reach system-level PPA (power, performance, area) that 2D designs cannot match. For vendors like Socionext — which target consumer SoCs as well as data-center accelerators — the ability to deliver working 3D-ICs rapidly opens new architectural options (heterogeneous dies, separable I/O fabrics, and modular chiplet ecosystems). Recent Socionext materials also show the company expanding 3DIC and 5.5D packaging support and promoting configurable chiplet building blocks to simplify system assembly.

Industry partnerships are central to this story. Socionext’s work with EDA and IP suppliers, and collaboration within the TSMC OIP ecosystem, demonstrate that 3D-IC success depends on an end-to-end supply chain: foundry stacking capabilities, packaging houses that can handle F2F and 5.5D substrates, EDA tools that understand multi-die timing and thermal behavior, and IP that is 3D-aware. The Synopsys writeup covering Socionext’s timeline explicitly credits the use of Synopsys’ 3D-enabled IP, AI-powered EDA flows, and cloud solutions as instrumental in hitting multiple tape-outs quickly.

What does this mean for the broader market? Faster, repeatable 3D tape-outs lower the barrier to entry for companies wanting to pursue heterogeneous integration. They also pressure incumbents to adopt modular approaches and to invest in multi-die verification and manufacturing readiness. However, scaling from tape-out to high-yield mass production remains the next big hurdle: yields, test strategies, and supply-chain throughput for advanced packaging will determine whether such rapid tape-out cycles translate into volume shipments and cost-effective products.

Bottom line: Socionext’s two tape-outs in seven months are more than a marketing sound bite, they’re a signal that the multi-die era is maturing. With the right mix of IP, EDA, foundry packaging, and ecosystem collaboration, complex 3D systems can move from experimental demos to production-grade devices on timelines that were hard to imagine just a few years ago.

Also Read:

Cerebras AI Inference Wins Demo of the Year Award at TSMC North America Technology Symposium

TSMC Kumamoto: Pioneering Japan’s Semiconductor Revival

AI-Driven DRC Productivity Optimization: Revolutionizing Semiconductor Design


RISC-V Extensions for AI: Enhancing Performance in Machine Learning

RISC-V Extensions for AI: Enhancing Performance in Machine Learning
by Daniel Nenni on 12-30-2025 at 10:00 am

SiFive Risc V Summit 2025

In a presentation at the RISC-V Summit North America 2025, John Simpson, Senior Principal Architect at SiFive, delved into the evolving landscape of RISC-V extensions tailored for artificial intelligence and machine learning. RISC-V’s open architecture has fueled its adoption in AI/ML markets by allowing customization and extension of core designs. However, Simpson emphasized the importance of balancing this flexibility with standardization under profiles like RVA23 to foster an open ecosystem that promotes innovation while preserving differentiation. As AI models grow exponentially—drawing from Epoch AI data showing model sizes surging from vector compute to massive matrix operations, the need for accelerated matrix multiplication and broader datatype support has become critical. Different application domains necessitate varied ISA approaches, but with only a handful of matrix multiply routines, software portability remains relatively unaffected by these choices.

Central to RISC-V’s AI capabilities is the Vector Extension (RVV), which addresses computations beyond matrix multiplies, such as those in activation functions like LayerNorm, Softmax, Sigmoid, and GELU. These operations, involving exponentials and normalizations, can bottleneck throughput when matrix multiplies are accelerated. For instance, prefilling Llama-3 70B with 1k tokens requires 5.12 billion exponential operations. RVV 1.0 supports integer (INT8/16/32/64) and floating-point (FP16/32/64) datatypes, with extensions like Zvfbmin for BF16 conversions and Zvfbwma for widening BF16 multiply-adds. Proposed additions, such as Zvfbta for BF16 arithmetic and Zvfofp8min for OCP FP8 (E4M3/E5M2) via conversions, aim to expand support. Discussions focus on using an altfmt bit in the vtype CSR to encode new datatypes efficiently, avoiding instruction length expansions. Future activity may include OCP MX formats like FP8/6/4, potentially requiring more instruction space or vtype bits.

Simpson outlined several matrix extension approaches under consideration by RISC-V task groups. The Zvbdot extension introduces vector batch dot-products without new state, leveraging existing vector registers. It computes eight dot-products per instruction, with one input from vector A and eight from group B (columns as registers), accumulating in group C. A 3-bit offset accesses up to 64 results. For VLEN=1024 with FP8 inputs and FP32 outputs, it achieves 1K MACs per instruction while writing only 256 bits, accelerating GEMM and GEMV with a vector-friendly read-heavy design.

Integrated Matrix Extensions (IME TG) reuse vector registers as matrix tiles, adding minimal vtype bits. They support matrix-matrix multiplies, with higher arithmetic intensity from longer vectors. Most sub-proposals require new tile load/store instructions, and Option-G is advancing. Write demands for result C might necessitate register renaming in the matrix unit, transparent to software.

Vector-Matrix Extensions (VME TG) add large matrix accumulator state for C, divided into tiles, while using RVV vectors for A and B. Outer-product multiplies accumulate into C, with potential “fat” support for narrower inputs. It includes moves between C and vectors/memory, enabling high throughput by placing accumulators near arithmetic units.

Attached Matrix Extensions (AME TG) introduce separate state for A, B, and C, performing matrix-matrix multiplies independently of RVV. If RVV is absent, new vector operations on matrix state are needed; otherwise, integration is preferred. Requiring dedicated load/store paths, AME offers the largest design space for peak performance, though no consensus proposal exists yet.

Performance varies by approach: Zvbdot suits LLM decode phases with batch=1, accelerating GEMV. IME fits edge devices prioritizing area/power. VME balances vector sourcing with high MACs, while AME maximizes MACs but demands more resources. For LLMs, larger batches improve efficiency but strain KV cache bandwidth.

Bottom line: These extensions position RISC-V as a versatile AI platform, evolving to meet diverse needs from edge to hyperscale. SiFive’s insights highlight ongoing standardization efforts to ensure scalability and ecosystem growth.

Also Read:

SiFive Launches Second-Generation Intelligence Family of RISC-V Cores

Podcast EP197: A Tour of the RISC-V Movement and SiFive’s Contributions with Jack Kang

Enhancing RISC-V Vector Extensions to Accelerate Performance on ML Workloads


Runtime Elaboration of UVM Verification Code

Runtime Elaboration of UVM Verification Code
by Tom Anderson on 12-30-2025 at 6:00 am

AMIQ UVM Runtime Elaboration in DVT IDE

Recently, I reported on my conversation with Cristian Amitroaie, CEO of AMIQ EDA, about automated generation of documentation from design and verification code. Before we chose that topic for a post, Cristian described several capabilities of the AMIQ EDA product family that might be of interest to design and verification engineers. For today’s post, I’ve selected runtime elaboration of Universal Verification Methodology (UVM) code because I wanted to know more about the benefits for engineers working on real-world chip projects.

What do you mean by elaboration?

When our tools read in design and verification code, we check for a wide variety of errors, and then we build a complex internal model that reflects every aspect of the code. For example, in our Design and Verification Tools (DVT) Integrated Development Environment (IDE) family, we perform a full design elaboration. That means we build a model with the complete design hierarchy and all the proper parameters computation, generate blocks computation, binds, etc. This allows design engineers to explore design hierarchies, trace signals and parameters, draw schematic diagrams, and perform many other useful tasks.

How do you handle verification code?

We also build a complete model for verification environments, which are usually based on UVM. Verification engineers often partially mirror the design hierarchy by a tree of components such as drivers and monitors organized in UVM testbench components. They also define and instantiate verification-specific components such as scoreboards and sequencers. All the components are connected together using transaction-level modeling (TLM) ports, defining a verification topology.

Is the verification topology like the design hierarchy?

In some ways yes, but verification topologies are not defined in a static manner like design hierarchies. There is no top module instantiating submodules, and so on, that can be statically computed. The verification topology is controlled per UVM test by activating or deactivating drivers, replacing some components with others tuned to match specific test requirements, connecting specific components to specific design interfaces, etc. The UVM verification component hierarchy is constructed by executing a specific UVM flow at simulation time 0. During this execution, all configuration via the “config db” setters/getters mechanism is performed, all the factory overrides are applied, and more.

What does this mean for DVT IDE?

The bottom line is that verification elaboration cannot be completed until UVM phase 0 (activity at time 0) is executed. We could have called a third-party simulator for this execution, but that takes time and adds overhead. Instead, DVT IDE actually performs a “run 0” internally to allow all the UVM elaboration to happen. We call this process UVM runtime elaboration to reflect its non-static nature.

How does this work in DVT IDE?

Users can ask for the runtime elaboration of a specific UVM test and use breakpoints to debug the “run 0” execution. When a breakpoint interrupts the execution, users can browse the call stacks on each parallel thread and inspect variables. We provide different types of breakpoints, which can be conditional. Users can browse the function call stack and all the breakpoints they’ve set in their project. They can also step through the executed code and inspect variable values, add log points to print information without altering the verification code, and add watchpoints to interrupt upon variable changes.

During UVM runtime elaboration, DVT IDE collects information about factory override definitions and if/where they are applied; information about the config database, including set/get calls and how they are paired; information about the register model, including address and bitfield computation; information about which physical interfaces are connected to virtual interfaces; and information about TLM port connections.

How does this help engineers create, explore, and debug the verification topology?

All this information collected is available in DVT IDE to help engineers explore their verification topology, the tree of components, the register model, the config db, and more. DVT IDE can also display a diagram of all the nested components, including their connections via TLM ports and their connections to the design via virtual interfaces. This is called the UVM Components Diagram.

We can determine some of this verification topology statically, but runtime elaboration allows us to compute actual data that perfectly matches what would happen in a simulator at time 0. Users get all the benefits I’ve mentioned without having to access a simulator. This saves time since the internal UVM runtime elaboration is faster than invoking an external tool that builds a model for full simulation.

What other capabilities benefit the users?

Three things spring to mind. First of all, many verification environments use C models in addition to UVM SystemVerilog code. We support DPI-C calls during “run 0” so this is not an issue. Second, if the verification code changes, users don’t have to go through the compilation and design elaboration process all over again. DVT IDE incrementally analyzes the changes and performs the UVM runtime elaboration. Finally, after the elaboration is done, we save a database that users can load anytime. This means that if there are no changes to the UVM topology, verification engineers can simply load the snapshot without having to execute runtime elaboration again.

Any final thoughts?

The capabilities I’ve listed are robust and well proven by many users over several years. In this post, I’ve only given an overview. To find out more, I recommend a concise tutorial available on our website. Of course, interested verification engineers can contact us to schedule a demo or request an evaluation license.

Thank you for your time, Cristian.

Likewise, and Happy Holidays!

Also Read:

Better Automatic Generation of Documentation from RTL Code

AMIQ EDA at the 2025 Design Automation Conference #62DAC

2025 Outlook with Cristian Amitroaie, Founder and CEO of AMIQ EDA


CISCO ASIC Success with Synopsys SLM IPs

CISCO ASIC Success with Synopsys SLM IPs
by Daniel Nenni on 12-29-2025 at 10:00 am

cisco silicon one networking 839x473

Cisco’s relentless push toward higher-performance networking silicon has placed extraordinary demands on its ASIC design methodology. As transistor densities continue to rise across advanced SoCs, traditional design-time guardbands are no longer sufficient to ensure long-term reliability, consistent performance, and efficient power consumption. Instead, these chips require deep, real-time observability throughout the operational lifecycle. The challenge is addressed through Cisco’s adoption of Synopsys Silicon Lifecycle Management (SLM) IPs. The company’s latest Silicon One ASICs integrate a broad set of embedded monitors and analytics capabilities that collectively redefine what in-silicon visibility looks like.

Modern networking ASICs operate under highly dynamic conditions. Voltage and temperature fluctuate constantly inside dense logic blocks, and variations in process corners across a single die can influence timing behavior in subtle but meaningful ways. Cisco faces additional pressure because its chips target mission-critical infrastructure where uptime, predictability, and performance efficiency are paramount. According to the success story, transistor aging, exacerbated by thermal and voltage cycling, can reduce timing slack over time, making continuous monitoring essential to safeguard performance margins.

To address these challenges, Cisco deployed a comprehensive suite of Synopsys SLM IPs across its newest ASIC platforms. At the center of this strategy is the Process, Voltage, and Temperature Monitor (PVT) subsystem, orchestrated by the PVT Controller (PVTC). The PVTC aggregates data from multiple distributed sensors, enabling a unified view of environmental and process states across the chip. With this real-time data, the system can support dynamic voltage and frequency scaling, optimizing power and performance based on immediate conditions rather than worst-case assumptions.

Several sensor types feed into this controller. The Process Detector identifies variations across silicon regions, helping Cisco tune performance and understand die-to-die differences. Voltage Monitors track fluctuations in supply rails, ensuring critical blocks operate within safe thresholds. Distributed Temperature Sensors and thermal diodes provide granular thermal maps, improving both thermal management and temperature-dependent calibration. Collectively, these sensors give unprecedented visibility into what is happening inside every major functional quadrant of the ASIC.

Beyond PVT data, Cisco uses the Path Margin Monitor to watch critical timing paths in real time. Instead of relying solely on static timing analysis or margin-heavy design, PMM enables early detection of timing degradation due to aging or unexpected workload conditions. Meanwhile, the Clock Delay Monitor focuses on SRAM behavior, measuring access times and ensuring that memory blocks meet their intended timing specifications during actual operation.

The results are substantial. Cisco has achieved significantly enhanced real-time observability across its ASIC designs, enabling dynamic optimization of power and performance rather than fixed guard-banding. The continuous monitoring of path margins and aging allows proactive reliability management, helping extend the usable lifespan of the silicon. The insights generated not only improve today’s chips but also feed back into future design cycles, refining models and guiding architectural decisions. The modular nature of Synopsys SLM IPs also ensures Cisco can tailor sensor density and placement to each ASIC’s unique requirements, balancing efficiency with coverage.

Bottom line: Cisco plans to leverage Synopsys Silicon.da analytics to mine the vast data produced under diverse operating conditions. This data-driven feedback loop positions Cisco to continue advancing high-performance networking silicon while reducing risk and improving consistency across its product lines. Through its collaboration with Synopsys, Cisco has established a new benchmark for ASIC observability, reliability, and lifecycle optimization in the networking domain.

https://www.synopsys.com/success-stories/cisco-enhances-asic-slm.html
Also Read:

How PCIe Multistream Architecture Enables AI Connectivity at 64 GT/s and 128 GT/s

WEBINAR: How PCIe Multistream Architecture is Enabling AI Connectivity

Lessons from the DeepChip Wars: What a Decade-old Debate Teaches Us About Tech Evolution


RISC-V: Powering the Era of Intelligent General Computing

RISC-V: Powering the Era of Intelligent General Computing
by Daniel Nenni on 12-29-2025 at 8:00 am

Andes RISC V Summit 2025 Charlie Su

Charlie Su, President and CTO of Andes Technology, delivered a compelling keynote at the 2025 RISC-V Summit North America, asserting that RISC-V is primed to drive the burgeoning field of Intelligent General Computing. This emerging paradigm integrates AI and machine learning into everyday computing devices, from AI-enabled PCs and smartphones to edge servers, software-defined vehicles, and robotic platforms. Su emphasized that advancements in AI/ML are infusing intelligence into general-purpose computing, enabling applications in personal use, factory automation, surveillance, drones, and autonomous driving (ADAS Levels 0-4). He predicted that robots, as app-enabled platforms, could surpass the smartphone market in scale. To support this, Intelligent General Computing demands a robust ecosystem for both general-purpose tasks and large-scale AI/ML, encompassing software and hardware.

Charlie highlighted RISC-V‘s role in fostering innovations for large-scale AI/ML. A prime example is Meta’s Training and Inference Accelerator (MTIA), which leverages Andes’ vector and scalar cores alongside the Automated Custom Extension (ACE) framework, as detailed in ISCA 2023. Two generations of MTIA have been deployed in Meta’s data centers since 2023, based on RISC-V processors with automated extensions. Other accelerators using SRAM-based Compute-In-Memory include solutions for servers (e.g., RiVos AI SoC), cloud services (SAPEON), photonics-based AI, and ADAS systems. These are powered by Andes cores like AX46MPV, AX45MPV, NX27V, and AX65, demonstrating RISC-V’s versatility in high-performance AI.

The RISC-V software ecosystem is maturing rapidly, bolstered by initiatives like RISE (RISC-V Software Ecosystem), which accelerates open-source software development, improves quality, and aligns efforts for cloud and IoT devices. Java 22/21 support is already in place, with tools spanning compilers (LLVM, GCC, GLIBC), system libraries (FFmpeg, OpenBLAS), kernel/virtualization (Linux, Android, Performance Profiles), and more. Premier members include Andes, Google, Intel, NVIDIA, Qualcomm, and Samsung. Debian’s open-source support underscores this maturity, with RISC-V achieving a 98.4% successful build rate across over 64,000 packages—ranking third overall. Metanoia’s 5G O-RAN software architecture further exemplifies modular, full open-source releases for semi-turnkey solutions.

Andes’ processor lineup is tailored for this era. The AX46MPV offers powerful compute and efficient control, compliant with RVA22+ including AIA and SV38/48/57 virtualization. It features dual-issue for vector/scalar instructions, a Vector Processing Unit (VPU) with VLEN/DLEN from 128-1024 bits, supporting int4-int64 and bf16/fp16-64 formats, plus enhanced ReductionSum. Multicore support reaches 16 cores, with boosted memory via dual-issue load/store, strong outstanding capabilities, and a High-speed Vector Memory (HVM) interface handling multiple OOO requests. Performance gains over AX45MPV include ~18% in SpecInt2006 (5.65 score), over 2x in key vector libraries (libvec, libnn), and +40% bandwidth.

The AX66, a mid-range application processor, is RVA23 compliant with dual vector pipes (VLEN=128), 4-wide frontend decode, 128-entry ROB, 8 execution pipelines, and TAGE-L branch predictor. It supports up to 8 cores, 32MB shared L3 cache (mostly exclusive), and 128/256-bit AXI4 interfaces with IOMMU, APLIC, and CHI. Vector performance yields >10x in libnn key functions (9.6x average), >4x in libvec (3.55x average), and significant crypto boosts (4.7x SHA-256, 10.5x AES-128, 6.4x SM4). Bandwidth increases by 25%.

For high-end needs, the Cuzco series scales to 20 SpecInt2k6/GHz, with patented time-based scheduling via Time Resource Matrix for efficient instruction issuing and power reduction. RVA23 compliant, it features 8-wide decode, 256 ROB entries, 8 pipelines (2 per slice), advanced branch prediction, private L1/L2 caches, up to 256MB shared L3, multiprocessor up to 8 cores, and CHI/256-bit MMIO. Early 5nm implementation targets 2.5GHz, with current SpecInt2006 at ~18/GHz, using 7M gates for CPU and 4.5M for 2MB L2.

Andes enhances the ecosystem with AndesAIRE, an “AI Runs Everywhere” end-to-end solution, including IDEs, NN SDKs, compilers (MLIR, TVM), interpreters (ONNX Runtime, PyTorch), and accelerators like AndLA 1350. OS support is comprehensive: RISC-V specs (RVA22/23 profiles, SoC platforms), Linux distros (Debian, Fedora, Ubuntu, verified by Andes), upstream kernel features (strace/ftrace, Perf, HIGHMEM, CPU hotplug, ongoing Suspend-to-RAM and PowerBrake), bootloaders (U-Boot, OpenSBI), and RTOS (FreeRTOS, Zephyr, Thread-X).

Bottom line: Charlie noted Andes leads RISC-V IP shipments with rich portfolios. The latest processors—AX46MPV for compute/control, AX66 to Cuzco for performance—position Andes strongly. The RISC-V ecosystem is ready for Intelligent General Computing, promising transformative impacts across industries.

Contact Andes

Also Read:

Journey Back to 1981: David Patterson Recounts the Birth of RISC and Its Legacy in RISC-V

Google’s Road Trip to RISC-V at Warehouse Scale: Insights from Google’s Martin Dixon

Bridging Embedded and Cloud Worlds: AWS Solutions for RISC-V Development


Simulating Quantum Computers. Innovation in Verification

Simulating Quantum Computers. Innovation in Verification
by Bernard Murphy on 12-29-2025 at 6:00 am

Innovation New

Quantum algorithms must be simulated on classical computers to validate correct behavior, but this looks very different from classical logic simulation. Paul Cunningham (GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and lecturer at Stanford, EE292A) and I continue our series on research ideas. As always, feedback welcome.

The Innovation

This month’s pick is How to Write a Simulator for Quantum Circuits from Scratch: A Tutorial. The authors are from École de Technologie Supérieure, Montreal and the University of Massachusetts. The paper was posted in June 2025 in arXiv.

Quantum simulators work on an abstraction – how qubits and “gates” are implemented is a fascinating topic but a distraction for this discussion. Our goal in this review is to introduce the topic of simulating quantum algorithms on a classical computer, because these methods are sufficiently disjoint from familiar classical computation to require an introduction before we move onto new research in this area.

This paper introduces a method to build a simulator for a small quantum computer (~20 qubits). It is supported by a web-based implementations and code walkthroughs to give a sense of how quantum simulation works. You should think of linear algebra methods to evaluate a circuit, multiplying an initial qubit vector by a series of tensors corresponding to gates in the circuit.

Paul’s view

Quantum venture funding is already well over $3B, rising fast and getting a lot of attention in the media. So what about verifying quantum circuits? First stop here is a quantum circuit simulator. Kudos to Bernard for finding a wonderfully written paper on this topic. It describes notations used for describing quantum circuits, both graphically and in equation form. It also works through the basic math needed to understand how a quantum simulator works. It’s an algorithmic level paper, not a paper on quantum physics.

In a digital circuit, each “bit” of state (a register or a wire) can be read and written independently. Logic simulators need to process transitions on registers and wires in time order, via an event queue, but this processing is local and need only consider the gates they are directly connected to.

In the quantum world “qu-bits” of state are “entangled” and need to be considered collectively as a single “state vector”. Simulating a quantum circuit proceeds like an analog circuit simulation where a vector of all the voltages or currents on each wire is formed, and simulation involves multiplying this vector with a matrix whose coefficients are determined by the circuit components and connectivity. For a circuit with n wires an analog simulator must multiply a 1 x n circuit state vector by an n x n simulation matrix derived from the circuit structure.

The cool thing about a quantum circuit is that a circuit with n qubits has a state vector with 2^n elements, one for each of the 2^n binary representations of n-bits. A quantum circuit performs operations simultaneously on all 2^n elements of this state vector, with means it conceptually operates in parallel on all 2^n possible values of the n qubits.

To simulate a quantum circuit with non-quantum digital hardware means multiplying a quantum state vector of size 2^n by a simulation matrix of size 2^n x 2^n, which is O(4^n) multiplications. The paper works through some neat algorithmic tricks based on some fundamental properties of quantum state vectors and simulation matrices that improves the runtime complexity to O(n.2^n). The elements of the state vector are floating point numbers, so the entire simulation maps very well to GPUs, e.g. this NVIDIA blog claims evaluating up to 36 qubits using eight A100s. Wow!

Each element in the state vector is a complex number, whose magnitude squared is the probability of the circuit being in that state. The sum of all the magnitude squared across the whole state vector is 1 and you can think of the state vector as representing a point on the surface of a 2^n dimensional hypersphere whose radius is 1. The goal of a typical quantum circuit algorithm is to use quantum gates to move the state vector around this hypersphere until it points almost perfectly along the axis of the dimension that is the desired result of the algorithm. Logic gates in digital circuits perform Boolean operations on state bits to calculate their result. Quantum gates rotate state vectors in various ways around their hypersphere. Developing a quantum algorithm requires figuring out a combination of rotational operations that move the state vector towards the desired result. Let’s see what Bernard can find published on what it means to verify these kinds of algorithm.

Raúl’s view

This month’s paper is a very nice, detailed tutorial on how to build a quantum circuit simulator using classical computing techniques, even with minimal prior knowledge of quantum mechanics. A simulator is verification 101; the purpose of creating a simulator from scratch is not as an alternative to existing open-source and commercial packages, but for a deeper understanding of quantum computing and the core algorithms necessary. It introduces essential quantum concepts and notations such as Dirac notation, state vectors, Hilbert space, tensor products, and the Bloch sphere, and quantum gates such as Hadamard, SWAP, Toffoli (CCNOT), Pauli X, Y and Z. Unlike physical quantum computers which collapse the state to 0 or 1 when measured, simulators can directly compute the complete state, including the probability of a 1 and the phase (Bloch sphere coordinates of each qubit). Measurement gates collapse the state and result in two new state vectors, corresponding to a measurement of 0 and 1.

The resulting simulator can handle up to ~20 qubits on a personal computer, utilizing roughly 1000–2000 lines of code in JavaScript (the largest quantum computer than can be simulated on a HPC is 50 qubits). An emphasis is placed on efficiency to handle the computational complexity associated with explicit matrix multiplication, in particular for Qubit-Wise Multiplication without explicitly forming the large layer matrices, but still O(2n nd) for d layers with n gates each; and SWAP, the exchange of the states of qubits simulated by directly manipulating the indices of the state vector’s amplitudes, also exponential in complexity. Further enhancements mentioned include adding robust error checking, implementing memory-saving in-place updates, and leveraging hardware acceleration via GPU programming.

I found the paper a great introduction to quantum computing. The online simulators help explain the basics, and the paper references commercial systems and more advanced research for readers interested in more detail.

Also Read:

Quantum Advantage is About the Algorithm, not the Computer

Quantum Computing Technologies and Challenges

Quantum Computing Algorithms and Applications