100X800 Banner (1)

Why IP Quality and Governance Are Essential in Modern Chip Design

Why IP Quality and Governance Are Essential in Modern Chip Design
by Admin on 10-30-2025 at 6:00 am

Why IP Quality and Governance are Essential in Modern Chip Design

By Kamal Khan

In today’s semiconductor industry, success hinges not only on innovation but also on discipline in managing complexity. Every system-on-chip (SoC) is built from hundreds of reusable IP blocks—standard cells, memories, interfaces, and analog components. These IPs are the foundation of the design. But if the foundation is weak, even the most ambitious architecture can fail.

This is where IP Quality and Governance in IPLM (Intellectual Property Lifecycle Management) become critical. They are not “nice to have” features. They are guardrails that protect design teams from costly errors, late rework, and unpredictable tape-outs.

The Value of IP Governance

Governance ensures that every IP block follows a clear lifecycle—moving from Development to Certification, then to Published, and eventually to Obsolete. At each step, policies define what must be true before an IP can move forward.

This isn’t about bureaucracy—it’s about consistency and trust. In a world where design teams are distributed across continents, governance makes sure that everyone is working from the same playbook.

Why Quality Rules Matter

IP Quality checks are the quality gates that enforce readiness. Instead of relying on subjective judgment, they apply automated rules and checklists:

  • Does this IP version pass all regression tests?
  • Is the CAD environment aligned with the correct process node?
  • Are mandatory properties and metadata captured?
  • Has it cleared security or licensing requirements?

The answer is binary: Pass or Fail. Only when an IP meets the defined criteria does it advance.

This objectivity reduces risk, improves confidence, and accelerates integration.

Real-World Examples

1. Safeguarding Tape-Out

A global SoC team nearly missed a major smartphone launch because a third-party IP was integrated at the wrong maturity level. The bug was caught late and fixing it cost both schedule and reputation. With IPLM governance, only Final-certified IPs would have been allowed into the design at tape-out, preventing the slip.

2. Ensuring Process Compatibility

An analog IP provider released an updated block for a 5nm process. Governance rules automatically checked the metal stack property. The rule flagged a mismatch before integration, saving weeks of debug time.

3. Enabling Global Collaboration

A memory IP updated in the U.S. was inadvertently used by an Asia-based team before it was production-ready. With IPLM, policies enforce semantic versioning and access control, ensuring that immature IPs stay restricted until they’re validated.

The Bigger Picture

The cost of a re-spin can run into the tens of millions. The cost of late discovery can be even higher: lost market windows.

By embedding IP Quality and Governance into the design process, organizations gain:

  • Predictability: Designs progress with fewer surprises.
  • Traceability: Every decision, rule, and approval is logged.
  • Scalability: Teams across geographies work with the same trusted data.

Closing Thought

In an era where semiconductor complexity grows faster than design cycles, discipline is the new differentiator. IP Quality and Governance in IPLM aren’t just technical features—they are strategic enablers of faster, safer, and smarter innovation.

Watch the Free Demo

Talk to an Expert

Kamal Khan is Perforce Vice President North America Automotive/Semiconductor. He has over 20 years of domestic and international experience, specializing in PLM, Data Management, IP lifecycle management, IoT Security, Semiconductors, Enterprise software, EDA, CAD, 3D Printing, Cloud solutions.

This article was originally published on Perforce.com. For more information on how Perforce IPLM streamlines semiconductor development , visit https://www.perforce.com/products/helix-iplm

Also Read:

IPLM Today and Tomorrow from Perforce

Perforce and Siemens at #62DAC

Perforce Webinar: Can You Trust GenAI for Your Next Chip Design?


U.S. Electronics Production Growing

U.S. Electronics Production Growing
by Bill Jewell on 10-29-2025 at 2:00 pm

U.S. Electronics Production Growing 1

U.S. electronics production has been on an accelerating growth trend over the last ten months. Three-month average change versus a year ago (3/12 change) has increased from 0.4% in October 2024 to 6.2% in August 2025. Japan’s 3/12 change has been positive since November 2024, but has been decelerating for most of 2025, reaching 1.1% in August. The 27 countries of the European Union (EU 27) have mostly experienced negative 3/12 change, with July 2025 at -1.8%. UK 3/12 change turned positive in July and reached 1.8% in August.

The accelerating growth in U.S. electronics production is partially due to U.S. and foreign companies shifting manufacturing to the U.S. from other countries. The shift is primarily due to the Trump administration’s tariffs – both implemented and threatened. However, the manufacturing shift has not led to increased employment in electronics manufacturing. According to the U.S. Bureau of Labor Statistics, U.S. employment in electronics manufacturing has declined from 1.04 million jobs in January 2024 to 1.001 million jobs in August 2025. August is the latest available month due to the government shutdown. In the last ten years, employment has been in a narrow range of 1.001 million to 1.062 million jobs.

Electronics production in key Asian countries has been volatile but generally growing over the last two years. The shift of electronics manufacturing to the U.S. has not had any noticeable impact on China. China electronics production has shown steady growth with 3/12 change in the range of 10% to 13% since January 2024. China’s 3/12 change in September 2025 was 10.7%. South Korea’s 3/12 change has moderated to 7.4% in July 2025 from 17% in May. Malaysia and Vietnam had similar trends in the last six months, with 3/12 change moderating from the 8% to 10% range in March through May 2025 to the 5% to 6% range in June through August. India has been volatile, with 3/12 change peaking at 15% in April 2025 and falling to zero in July. India’s August 3/12 change was 1.9%.

As noted in our June 2025 newsletter, U.S. imports of smartphones have dropped significantly beginning in April 2025. The Trump administration stated in April 2025 that smartphones were currently exempt from tariffs but would be subject to tariffs in “a month or two.” Six months later, no smartphone tariffs have been announced. Imports remained low based on the latest data available through July. Due to the U.S. government shutdown, data for August in not available. However, China’s export data is available through September 2025. Apparently, China has a functioning government. In September, China’s exports of smartphones to the U.S. increased sharply to 2.26 million units from 1.02 million units in August, an increase of 121%. The average unit price (AUP) of these smartphones almost doubled from $702 in August to $1,387 in September. These increases coincide with Apple’s introduction of its iPhone 17 models in September 2025. Apple and other smartphone companies have apparently been limiting imports to the U.S. due to the tariff threat, but increased imports in September to support the release of new models.

The Trump administration’s tariff policy may have contributed to increased U.S. electronics production, but it has not led to new jobs in the industry. The current tariffs and potential tariffs continue to cause uncertainty in the U.S. and global electronics industry.

Bill Jewell
Semiconductor Intelligence, LLC
billjewell@sc-iq.com

Also Read:

Semiconductor Equipment Spending Healthy

Semiconductors Still Strong in 2025

U.S. Imports Shifting


Failure Prevention with Real-Time Health Monitoring: A proteanTecs Innovation

Failure Prevention with Real-Time Health Monitoring: A proteanTecs Innovation
by Daniel Nenni on 10-29-2025 at 10:00 am

proteanTecs RTHM OIP 2025

In the complex world of semiconductors, reliability, availability, and serviceability (RAS) have become paramount, especially as devices shrink to nanoscale geometries like 2nm. At the recent 2025 TSMC OIP Forum Noam Brousard, VP of Solutions Engineering at proteanTecs, presented “Failure Prevention with Real-Time Health Monitoring (RTHM™),” highlighting how modern electronics face unprecedented challenges. From smaller architectures and high-performance workloads to hyper-competition and cost pressures, these factors contribute to functional failures, silent data corruption, and system-wide errors. As hardware must endure longer lifecycles, often 4-6 years, without refresh, the risk of failures escalates, particularly in large-scale AI systems where devices operate at lower voltages and under unpredictable demands.

Silent data corruption (SDC) emerges as an insidious threat. Unlike detectable errors, SDC stems from untraceable hardware failures that evade exception mechanisms and system logs. It propagates undetected, causing cascading issues that demand extensive root-cause analysis. In AI-driven environments, SDC can yield incorrect outputs, faulty decisions, and parameter corruption in models, with catastrophic implications for critical applications. Brousard cited real-world examples underscoring SDC’s rise. Meta reported miscalculated mathematical operations in defective CPUs leading to database losses, where a file decompression error produced zero instead of 156. Alibaba Cloud encountered checksum mismatches in storage apps due to intermittent processor faults. Google noted manufacturing defects exposed by rare instructions in low-level libraries, while other cases involved incorrect hashing and cache coherence issues. Studies from Google, Meta, Facebook, and Alibaba reveal that approximately one in a thousand machines in large fleets suffers from SDC, emphasizing its prevalence in production CPU populations.

Traditional approaches fall short. Built-in self-test (BIST) integrations are complex and expensive, running only at startup with slow responses and no precise location pinpointing. Hardware and software checks often react post-failure, lacking the granularity needed for proactive intervention.

proteanTecs’ RTHM, part of their comprehensive lifecycle solutions spanning power/performance optimization, reliability monitoring, functional safety, chip and system production, and advanced packaging. RTHM shifts the paradigm from error containment to failure avoidance by providing electronics visibility from within. It employs on-chip Agents for high-coverage, continuous monitoring of actual performance-limiting paths, both at test and in mission mode. These Agents sample high-speed clocks in real paths, adhering to power-performance-area (PPA) constraints, and are sensitive to workload stress, latent defects, operating conditions, DC IR drops, local Vdroops, hot spots, and aging.

A key feature is the Performance Index, an event-based algorithm that aggregates timing margin measurements across thresholds, affected areas, clock/power domains, and prior events. Analyzed per logical unit, PI delivers an integrated score reflecting issue severity—how close a device is to failure. Visualized as a percentage (e.g., 79%), it enables operators to act before problems escalate.

Without RTHM, failures manifest after escalation, complicating root causes and incurring costly downtime. With it, potential issues are identified and mitigated preemptively, yielding faster, accurate, cost-effective predictions. This proactive stance avoids functional failures, prevents SDC, and eliminates system-wide errors. RTHM offers accurate fault detection at the circuit level, reliability monitoring for intrinsic/extrinsic faults, and unmatched resiliency to halt error propagation.

Bottom line: As semiconductors push boundaries, RTHM represents a transformative tool. By embedding intelligence directly into chips, it empowers engineers to predict and avert failures, safeguarding operations in an era of scale and complexity.

For more, contact proteanTecs.

Also Read:

Podcast EP313: How proteanTecs Optimizes Production Test

Thermal Sensing Headache Finally Over for 2nm and Beyond

DAC News – proteanTecs Unlocks AI Hardware Growth with Runtime Monitoring


Podcast EP314: An Overview of Toshiba’s Strength in Power Electronics with Jake Canon

Podcast EP314: An Overview of Toshiba’s Strength in Power Electronics with Jake Canon
by Daniel Nenni on 10-29-2025 at 8:00 am

Daniel is joined by Jake Canon, senior business development engineer at Toshiba America Electronic Components. Jake is an enthusiastic contributor to the semiconductor industry and has been working closely with engineers to find new discrete power solutions for a wide variety of cutting-edge applications.

Dan explores the 150-year history of Toshiba with Jake, who focuses on Toshiba’s advanced work in power electronics. Jake describes the work Toshiba has done and is doing with low voltage power MOSFETs. He explains that Toshiba is on its 11th generation of these devices. He provides excellent detail on the innovations that have been achieved in both device and packaging and the broad range of applications supported such as automotive and AI. Jake also describes the worldwide manufacturing footprint of Toshiba.

The views, thoughts, and opinions expressed in these podcasts belong solely to the speaker, and not to the speaker’s employer, organization, committee or any other group or individual.


Inference Acceleration from the Ground Up

Inference Acceleration from the Ground Up
by Lauro Rizzatti on 10-29-2025 at 6:00 am

VSORA AI CHip

VSORA, a pioneering high-tech company, has engineered a novel architecture designed specifically to meet the stringent demands of AI inference—both in datacenters and at the edge. With near-theoretical performance in latency, throughput, and energy efficiency, VSORA’s architecture breaks away from legacy designs optimized for training workloads.

The team behind VSORA has deep roots in the IP business, having spent years designing, testing, and fine-tuning their architecture. Now in its fifth generation, the architecture has been rigorously validated and benchmarked over the past two years in preparation for silicon manufacturing.

Breaking the Memory Wall

The “memory wall” has challenged chip designers since the late 1980s. Traditional architectures attempt to mitigate the impact on performance induced by data movement between external memory and processing units by layering memory hierarchies, such as multi-layer caches, scratchpads, and tightly coupled memory, each offering tradeoffs between speed and capacity.

In AI acceleration, this bottleneck becomes even more pronounced. Generative AI models, especially those based on incremental transformers, must constantly reprocess massive amounts of intermediate state data. Conventional architectures struggle here. Every cache miss—or any operation requiring access outside in-memory compute—can severely degrade performance.

VSORA tackles this head-on by collapsing the traditional memory hierarchy into a single, unified memory stage: a massive SRAM array that behaves like a flat register file. From the perspective of the processing units, any register can be accessed anywhere, at any time, within a single clock. This eliminates costly data transfers and removes the bottlenecks that hamper other designs.

A New AI Processing Paradigm: 16 Million Registers per Core

At the core of the VSORA’s architecture is a high-throughput computational tile consisting of 16 processing cores. Each core integrates 64K multi-dimensional matrix multiply–accumulate (MAC) units, scalable from 2D to arbitrary N-dimensional tensor operations, alongside eight high-efficiency digital signal processing (DSP) cores. Numerical precision is dynamically configurable on a per-operation basis, ranging from 8-bit fixed-point to 32-bit floating-point formats. Both dense and sparse execution modes are supported, with runtime-selectable sparsity applied independently to weights or activations, enabling fine-grained control of computational efficiency and inference performance.

Each core incorporates an unprecedented 16 million registers, orders of magnitude higher than the few hundred to few thousand typically found in conventional architectures. While such a massive register file would normally challenge traditional compiler designs, VSORA overcomes these with two architectural innovations:

  1. Native Tensor Processing: VSORA’s hardware natively supports vector, tensor, and matrix operations, removing the need to decompose them into scalar instructions. This eliminates the manual implementation of nested loops often required in GPU environments such as CUDA, thereby improving computational efficiency and reducing programming complexity.
  2. High-Level Abstraction: Developers program at a high level using familiar frameworks, such as PyTorch and ONNX for AI workloads, or Matlab-like functions for DSP, without the need to write low-level code or manage registers directly. This abstraction layer streamlines development, enhances productivity, and maximizes hardware utilization.

Chiplet-Based Scalability

VSORA’s physical implementation leverages a chiplet architecture, with each chiplet comprising two VSORA computational tiles. By combining VSORA chiplets with high-bandwidth memory (HBM) chiplet stacks, the architecture enables efficient scaling for both cloud and edge inference scenarios.

  • Datacenter-Grade Inference. The flagship Jotunn8 configuration pairs eight VSORA chiplets with eight HBM3e chiplets, delivering an impressive 3,200 TFLOPS of compute performance in FP8 dense mode. This configuration is optimized for large-scale inference workloads in datacenters.
  • Edge AI Configurations. For edge deployments, where memory requirements are lower, VSORA offers:
    • Tyr2: Two VSORA chiplets + one HBM chiplet = 800 TFLOPS
    • Tyr4: Four VSORA chiplets + one HBM chiplet = 1,600 TFLOPS

These configurations empower efficient tailoring of compute and memory resources to suit the constraints of edge applications.

Power Efficiency as a Side Effect

The performance gains are evident, but equally remarkable are the advances in processing and power efficiency.

Extensive pre-silicon validation using leading large language models (LLMs) across multiple concurrent workloads demonstrated processing efficiencies exceeding 50%, that’s an order of magnitude higher than state-of-the-art GPU-based designs.

In terms of energy efficiency, the Jotunn8 architecture consistently delivers twice the performance-per-watt of comparable solutions. In practical terms, its power draw is limited to approximately 500 watts, compared to more than one kilowatt for many competing accelerators.

Collectively, these innovations yield multiple times higher effective performance at less than half the power consumption, translating to an overall system-level advantage of 8–10× over conventional implementations.

CUDA-Free Compilation Simplifies Algorithmic Mapping and Accelerate Deployment

One of the often-overlooked advantages of the VSORA architecture lies in its streamlined and flexible software stack. From a compilation perspective, the flow is dramatically simplified compared to traditional GPU environments like CUDA.

The process begins with a minimal configuration file of just a few lines that defines the target hardware environment. This file enables the same codebase to execute across a wide range of hardware configurations, whether that means distributing workloads across multiple cores, chiplets, full chips, boards, or even across nodes in a local or remote cloud. The only variable is execution speed; the functional behavior remains unchanged. This makes on-premises and localized cloud deployments seamless and scalable.

A Familiar Flow, Without the Complexity

Unlike CUDA-based compilation processes, the VSORA flow appears reassuringly basic without the layers of manual tuning and complexity. Traditional GPU environments often require multiple painful optimization steps that, when successful, can deliver strong performance, but are fragile and time-consuming. VSORA simplifies this through a more automated and hardware-agnostic compilation approach.

The flow begins by ingesting standard AI inputs, such as models defined in PyTorch. These are processed by VSORA’s proprietary graph compiler, which automatically performs essential transformations such as layer reordering or slicing for optimal execution. It extracts weights and model structure and then outputs an intermediate C++ representation.

This C++ code is then fed into an LLVM-based backend, which identifies the compute-intensive portions of the code and maps them to the VSORA architecture. At this stage, the system becomes hardware-aware, assigning compute operations to the appropriate configuration—whether it’s a single VSORA tile, a TYR4 edge device, a full Jotunn8 datacenter accelerator, a server, a rack or even multiple racks in different locations.

Invisible Acceleration for Developers

From a developer’s point of view, the VSORA accelerator is invisible. Code is written as if it targets the main processor. During compilation, the compilation flow identifies the code segments best suited for acceleration and transparently handles the transformation and mapping to VSORA hardware. This significantly lowers the barrier for adoption, requiring no low-level register manipulation or specialized programming knowledge.

VSORA’s instruction set is high-level and intuitive, carrying over rich capabilities from its origins in digital signal processing. The architecture supports AI-specific formats such as FP8 and FP16, as well as traditional DSP operations like FP16 arithmetic, all handled automatically on a per-layer basis. Switching between modes is instantaneous and requires no manual intervention.

Pipeline-Independent Execution and Intelligent Data Retention

A key architectural advantage is pipeline independence—the ability to dynamically insert or remove pipeline stages based on workload needs. This gives the system a unique capacity to “look ahead and behind” within a data stream, identifying which information must be retained for reuse. As a result, data traffic is minimized, and memory access patterns are optimized for maximum performance and efficiency, reaching levels unachievable in conventional AI or DSP systems.

Built-In Functional Safety

To support mission-critical applications such as autonomous driving, VSORA integrates functional safety features at the architectural level. Cores can be configured to operate in lockstep mode or in redundant configurations, enabling compliance with strict safety and reliability requirements.

Conclusion

VSORA is not retrofitting old designs for modern inference needs, instead it’s building from the ground up. With a memory architecture that eliminates traditional bottlenecks, compute units tailored for tensor operations, and unmatched power efficiency, VSORA is setting a new standard for AI inference—whether in the cloud or at the edge.

Also Read:

The Rise, Fall, and Rebirth of In-Circuit Emulation (Part 1 of 2)

The Rise, Fall, and Rebirth of In-Circuit Emulation: Real-World Case Studies (Part 2 of 2)

Silicon Valley, à la Française


AI-Driven DRC Productivity Optimization: Revolutionizing Semiconductor Design

AI-Driven DRC Productivity Optimization: Revolutionizing Semiconductor Design
by Daniel Nenni on 10-28-2025 at 10:00 am

AI Driven DRC Productivity Optimization Siemens AMD TSMC

The semiconductor industry is undergoing a transformative shift with the integration of AI into DRC workflows, as showcased in the Siemens EDA presentation at the 2025 TSMC OIP. Titled “AI-Driven DRC Productivity Optimization,” this initiative, led by Siemens EDA’s David Abercrombie alongside AMD’s Stafford Yu and GuoQin Low, highlights a collaborative effort to enhance productivity and efficiency in chip design. The presentation outlines a comprehensive AI system that revolutionizes the entire EDA workflow, from knowledge sharing to automated fixing and debugging.

At the core of this innovation is the Siemens EDA AI System, which leverages a GenAI interface, knowledge base, and data lake to integrate AI tools across the portfolio. This system, deployable on customer hardware or cloud environments, supports a unified installation process and offers flexibility to incorporate customer data and models. Tools like the AI Docs Assistant and Calibre RVE Check Assist boost user understanding by providing instant answers and leveraging TSMC design rule data, respectively. The AI Docs Assistant, accessible via browser or integrated GUIs, uses retrieval-augmented generation to deliver relevant citations, while Calibre RVE Check Assist enhances debugging with specialized images and descriptions from TSMC.

Collaboration is a key pillar, with features like Calibre RVE Check Assist User Notes enabling in-house knowledge sharing. Designers can capture fixing suggestions and images, creating a shared knowledge base that enhances DRC-fixing flows across organizations. Meanwhile, Calibre DesignEnhancer automates the resolution of DRC violations on post-routed designs, using analysis-based modifications to insert sign-off DRC-clean interconnects and vias. This tool’s ability to handle complex rules and dependencies makes it a standalone DRC fixing solution.

Calibre Vision AI addresses the unique challenges of full-chip integration by offering AI-guided DRC analysis. It provides lightning-fast navigation through billions of errors, intelligent debug clustering, and cross-user collaboration tools like bookmarks and HTML reports. AMD’s testimonial underscores a 2X productivity boost in systematic error debugging, with Vision AI reducing OASIS database sizes and load times significantly. Signals analysis, such as identifying fill overlaps with clock cells or CM0 issues in breaker cells, accelerates root-cause identification.

This AI-driven approach, bolstered by AMD and TSMC collaborations, optimizes DRC sign-off productivity by boosting workflows, understanding, fixing, debugging, and collaboration. As the industry moves toward more complex designs, Siemens EDA’s AI system sets a new standard, promising faster cycle times and enhanced design robustness, paving the way for future innovations in semiconductor technology.

For more information contact Siemens EDA

Great presentation, absolutely.

Also Read:

Visualizing hidden parasitic effects in advanced IC design 

Protect against ESD by ensuring latch-up guard rings

Something New in Analog Test Automation


Emulator-Like Simulation Acceleration on GPUs. Innovation in Verification

Emulator-Like Simulation Acceleration on GPUs. Innovation in Verification
by Bernard Murphy on 10-28-2025 at 6:00 am

Innovation New

GPUs have been proposed before to accelerate logic simulation but haven’t quite met the need yet. This is a new attempt based on emulating emulator flows. Paul Cunningham (GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and lecturer at Stanford, EE292A) and I continue our series on research ideas. As always, feedback welcome.

The Innovation

This month’s pick is GEM: GPU-Accelerated Emulator-Inspired RTL Simulation. The authors are from Peking University, China, and NVIDIA. The paper was presented at DAC 2025 and has no citations so far.

There have been previous attempts to accelerate logic simulation using GPU hardware, which have apparently foundered on a poor match between the heterogenous nature of logic circuit activity and the SIMT architecture of GPUs. This paper proposes a new approach, modeled on FPGA-based emulators/prototypers and supported by a very long instruction word architecture. It claims impressive speedup over CPU-based simulation.

This paper is from NVIDIA and Peking University and was presented at DAC2025 this year.

Paul’s view

Very interesting paper this month from NVIDIA Research and Peking University.  It takes a fresh look at accelerating logic simulation on GPUs, something Cadence has invested heavily in since acquiring Rocketick in 2016. With the explosion in GPU computing for AI, customer motivation to use GPUs to accelerate simulation is even higher, and we are doubling down our efforts in this area.

An NVIDIA GPU is a massive single-instruction-multiple-thread (SIMT) machine. To harness its power requires mapping a circuit to a large number of threads that each execute the same underlying program with minimal inter-thread communication. The key to doing this is intelligent replication and intelligent partitioning of logic cones across threads. Replication reduces inter-thread communication: rather than computing shared fan-in of multiple logic cones in one thread and passing the result to other threads, just have threads for each logic cone replicate compute for that shared fan-in. Smart partitioning ensures that thread processors are well utilized: we don’t want thread processors executing very deep logic cones to leave other thread processors idle that executed short paths.

In this paper, the authors synthesize a circuit to an AND-Inverter graph. To mitigate the problem of a few deep logic cones bottlenecking parallelization, they introduce a “boomerang” partitioner. This partitioner aims to balance the fan-in width of each partition rather than the gate count of each partition. Each partition is then mapped to a bit-packed structure that can be batch loaded from memory and executed on an NVIDA GPU very efficiently. This bit-packed structure uses a 32 bit integer AND instruction followed by a 32 bit XOR with a mask instruction to perform 32 AND-INVERT operations in one shot, with all thread processors executing this same simple program.

The authors benchmark their solution, GEM, on 5 different open source designs ranging in size from 670k to 5.5M gates. Comparing GEM on an NVIDIA A100 to a “commercial” RTL logic simulator running on a single core of a Intel Xeon 6136 (Skylake), GEM runs on average 20x faster on the smallest design to 2.5x faster on the largest design. Impressive!

Raúl’s view

CPU-based RTL simulators are relatively slow and poorly scalable, while FPGA-based emulators are fast but expensive to set-up and inflexible. The heterogeneous, irregular nature of digital circuits conflicts with GPUs’ SIMT (Single Instruction, Multiple Thread) architecture. GEM (GPU accelerated emulator) overcomes these challenges yielding the following results on commodity GPUs: an average of 6x speed-up over 8-threaded Verilator and 9x over a leading commercial simulator. The system is open sourced under Apache 2.0.

GEM’s main innovations are the “boomerang executor layer” which handles intra-block efficiency (how logic is executed inside a GPU thread block), and the partitioning flow which handles inter-block scalability (how a large circuit is divided into many pieces that can be simulated in parallel on thousands of GPU cores).

Mapping logic circuits to a GPU, a common method is levelization: divide the circuit into “logic levels” so that gates at the same depth can be computed in parallel. But real circuits have many levels with only a few gates. In GPU kernels running in SIMT fashion each level would trigger a global synchronization, and most GPU threads would be idle most of the time. The result is poor GPU utilization and a large synchronization overhead. Instead, in GEM each GPU thread block (representing one circuit partition) maintains a set of bits (8192 circuit states) in shared memory and the boomerang layer executes logic across multiple levels (14 levels) in one pass. It processes these bits in a recursive, folded structure: pairs of bits A and B of the circuit state are repeatedly combined with an external constant C using bitwise logic operations:

r = (A AND B) XOR C

Conceptually, this is like “folding” the bit vector in half multiple times—each fold represents several logic levels being collapsed into one operation. It is performed in parallel across 32-bit words enabling word-level parallelism in addition to thread-level parallelism. These foldings are repeated 14 times until one resulting bit represents the output of a deep cone of logic. This “boomerang” execution pattern effectively computes the equivalent of 10-15 logic levels; because all operations happen within a thread block, synchronization is local, avoiding costly global GPU synchronizations. The boomerang shape corresponds to how logic density changes across circuit depth: many gates at shallow levels (wide part of the boomerang), few gates at deeper levels (narrow part).

The partitioning flow deals with: 1) inter-block dependencies (If two circuit partitions depend on each other’s outputs within a simulation cycle, you’d need global synchronization every time step, killing performance) and 2) replication cost; because it is based on a known algorithm called RepCut (Replication-Aided Partitioning) to reduce these dependencies by duplicating some logic across partitions so each partition can simulate independently. RepCut, was designed for 10s of CPU threads, not for hundreds or thousands of GPU thread blocks, so using it directly, the amount of duplicated logic grows to over 200% duplication for just 200+ partitions. Instead of cutting the entire circuit into hundreds of partitions in one go, GEM performs multi-stage RepCut: splitting large designs in stages, minimizing replication, aligning partition size to GPU architecture constraints (boomerang width), merging intelligently to ensure efficient GPU occupancy. Introducing one additional synchronization point between stages. This reduces replication to <3% for a 500K-gate circuit partitioned into 216 blocks.

GEM’s innovation lies in its emulator-inspired abstraction that maps circuit logic to GPU execution through a VLIW architecture and highly local memory access. The mapping flow borrows from traditional EDA synthesis, placement, and partitioning logic. It achieves high simulation density and GPU efficiency, outperforming multi-threaded Verilator (6x), commercial CPU-based tools (9x) and previous GPU approaches (8x) across design types ranging from RISC-V CPUs to AI accelerators. Keep in mind that this is 2 state bit level simulation without the bells and whistles of a commercial simulator.

GEM combines EDA methods with GPU computing in a software-only, open-source package compatible with standard GPUs, which makes it appealing. While described as RTL level, it operates at the bit level like FPGA emulators. GEM currently lacks multi-GPU support, 4-state logic, arithmetic modeling, and event-driven pruning, requiring further development to potentially become a competitive simulation alternative.

Also Read:

Cadence’s Strategic Leap: Acquiring Hexagon’s Design & Engineering Business

Cocotb for Verification. Innovation in Verification

A Big Step Forward to Limit AI Power Demand


CEO Interview with Wilfred Gomes of Mueon Corporation

CEO Interview with Wilfred Gomes of Mueon Corporation
by Daniel Nenni on 10-27-2025 at 2:00 pm

wilfred gomes

Wilfred Gomes is the co-founder, CEO, and president of Mueon Corporation, a next-generation infrastructure startup rethinking how data centers are built for the AI era. The company’s flagship innovation, Cubelets™, modular, stackable units that unite compute, memory, power delivery and thermal management, replace the traditional rack-based model that have dominated data centers since their inception, enabling up to 10x gains in density, energy efficiency and deployment speed.

He served as a Fellow in Microprocessor Design and Technologies at Intel for nearly three decades, where he drove innovations in advanced packaging, EDA for data center, AI, and client platforms, with a focus on logic, memory and implementation efficiency. He was a co-inventor of Intel’s Foveros 3D integration platform and was instrumental in bringing 3D stacking into mainstream production. With over 250 patents across high-performance CPU and GPU design, Wilfred has played a pivotal role in charting the path toward the next era of AI-scale workloads.

Tell us about your company?

Mueon is redefining how modern data centers are built for the AI era. Founded with the belief that the core infrastructure of computing requires a fundamental change, Mueon is creating a new architectural foundation that unites compute, memory, power delivery, and thermal management into a single, modular system. Our flagship innovation, Cubelets™, replaces the traditional rack-based model that has dominated data centers since its inception, removing the current limits that constrain how silicon systems are built and scaled. Mueon is focused on making data centers more efficient, reliable, powerful, and significantly scalable, while reducing their carbon footprint.

What problems are you solving? 

The AI era has pushed data centers to their physical and architectural limits. Power, cost, and complexity are now critical bottlenecks holding AI innovation back. Traditional rack-based systems draw enormous amounts of electricity, generate heat that’s increasingly difficult to manage, and are expensive and slow to scale. Even with massive investment, operators are running into physical limits on how much performance they can extract from traditional architectures.

A major part of the problem lies in the memory hierarchy itself. Today’s compute systems treat memory as a fragmented stack, forcing data to constantly move between layers and adding latency, inefficiencies, and extra costs. Everyone wants memory that’s 10-100 times larger and faster, but no technology exists today to make that possible.

At Mueon, we’re rethinking this from the ground up. Our Cubelet architecture integrates compute, memory, power delivery, and thermal management into a single modular unit designed to bring data and compute closer together – creating high bandwidth and low latency domains. In practice, that means memory that behaves as a single, instantly accessible pool, a capability no traditional system can match. The result is a new class of data infrastructure that eliminates the tradeoffs between power, cost, and performance, and sets the foundation for computing systems with no architectural ceiling.

What application areas are your strongest?

Mueon’s strongest application areas center on three tightly linked domains: memory, power delivery, and thermal management. These are not independent domains; they have to be addressed together. At scale, each one affects the others, and true performance gains come through co-optimization. Mueon’s Cubelet architecture was built precisely for that intersection. By integrating and co-designing these three elements within a single system, Cubelets achieve breakthroughs that aren’t possible in legacy architectures, giving us a clear edge in scaling AI systems. Memory performance improves because power and cooling are managed at the chip and module level. Power efficiency increases because heat is dissipated intelligently. Thermal balance is maintained because compute, memory, and power delivery are treated as a unified whole.

This requires a fundamental change in architecture, one that brings the right technologies together in the right way. Our success comes from applying this co-optimization framework across every layer of the stack, enabling higher density, efficiency, and scalability, while remaining fully compatible with existing AI and cloud software environments.

What keeps your customers up at night?

Our customers are grappling with the realities of scaling infrastructure in the AI era, and their concerns can be summarized into three key areas: power, cost, and complexity.

Power – AI workloads are expanding so quickly that operators are running into challenges with electricity and cooling capacity. Without new approaches, operators risk hitting unscalable ceilings. Even with major investment, the physical and environmental limits of current data centers make it increasingly difficult to scale.

Cost – Building large-scale systems requires billions in capital for servers, power, cooling, networking, and real estate. Customers fear that these investments may not keep pace with AI’s rapidly changing demands, leaving them with stranded assets that are costly to operate and unable to deliver the performance they need.

Complexity – Coordinating compute, memory, power, and thermal management across tens of thousands of racks is a daunting engineering challenge. This slows development cycles, increases operational risk, and leaves customers feeling that AI innovation is outpacing their ability to adapt.

Mueon removes these limits; our Cubelet architecture integrates compute, memory, power, and thermal management into a single modular system, lowering costs, simplifying operations, and enabling AI infrastructure to scale efficiently.

What does the competitive landscape look like and how do you differentiate?

Much of the industry is still focused on extracting marginal gains from the same rack-based paradigm that has defined data centers for decades. Traditional OEMs and hyper-scalers push for incremental improvements in chip performance, cooling, or energy use, but they’re constrained by the physical and architectural limits of the rack. Some companies explore new cooling methods or form factors, but most solve for one layer of the problem rather than redefining the full system.

At Mueon, we see this as a moment to move beyond incremental improvements to redefine how entire systems are built. Moore’s Law has driven extraordinary advances at the silicon level, but it hasn’t been matched by equivalent innovation in how systems are built and scaled. We believe the next leap in computing performance will come from rethinking those abstractions – from chip to system to data center.

Our Cubelets architecture embodies that leap. By integrating compute, memory, power delivery, and thermal management into a single modular unit, we eliminate the artificial boundaries that slow innovation. This architecture delivers order-of-magnitude gains in density, deployment speed, and energy efficiency, while remaining compatible with today’s AI and cloud software stacks. Mueon is leading the next wave of abstraction, one of the first to deliver a fundamentally new model for building data centers in the AI age.

What new features/technology are you working on?

Our goal is to remove the physical limits that have constrained how silicon systems are built and scaled; there should be no limit to how large or how complex a chip can be. We’re developing technology that allows silicon to be built, stacked, and interconnected at unprecedented scale – whether that means going smaller or larger.

Scaling Down – Pushing toward the smallest possible dimensions for compute, memory, and interconnects to maximize efficiency and density.

Scaling Up – Enabling arbitrarily large chips and multi-layered stacking architectures that operate as a single coherent system.

Together, these dimensions unlock new possibilities for performance, efficiency, and system design. By breaking free from traditional limits on packaging and processing, Mueon is creating a foundation where compute can expand organically – without the bottlenecks that defined the last generation of data center architecture. We’re not just enabling faster chips, we’re creating the foundation for entirely new classes of computing.

How do customers normally engage with your company?

Right now, most of our engagements are with leading AI companies and hyperscalers that are actively building the next generation of AI chips and data infrastructure. These organizations are deeply aligned with our mission to remove the limits on how silicon can be designed, scaled, and deployed. We’re working closely with them to co-develop systems that push the boundaries of performance and efficiency.

We welcome conversations with anyone tackling similar challenges or exploring new chip models. Collaboration is core to how we operate, whether you’re building advanced AI systems, experimenting with chip architectures, or simply have ideas about where silicon design should go next.

Also Read:

CEO Interview with Alex Demkov of La Luce Cristallina

CEO Interview with Dr. Bernie Malouin Founder of JetCool and VP of Flex Liquid Cooling

CEO Interview with Gary Spittle of Sonical


Pioneering Edge AI: TekStart’s Cognitum Processor Ushers in a New Era of Efficient Intelligence

Pioneering Edge AI: TekStart’s Cognitum Processor Ushers in a New Era of Efficient Intelligence
by Daniel Nenni on 10-27-2025 at 10:00 am

2510.1 Semiwiki Image 1

One of the more in interesting companies I met at the AI Infra Summit was a company known to me for some time. The most interesting part was the chip they are in the process of taping out; It is a high-performance, ultra-low-power AI processor purpose-built for edge computing. It is claimed to deliver “the processing muscle to run advanced AI inference directly where data is generated without the energy burden that typically comes with it.”

In an era where artificial intelligence is no longer confined to data centers but embedded in the fabric of everyday devices, power efficiency and real-time performance stand as the ultimate battlegrounds. TekStart Group, a trailblazing venture studio founded in 1998, has long championed the transformation of raw technological innovation into market-dominating realities. With a track record of developing, funding, and exiting over 120 companies, TekStart’s ChipStart® division specializes in custom ASIC design and outsourced operations, bridging the gap between concept and commercialization. Headquartered in Burlington, Ontario, with global R&D hubs in Irvine, California, and Kyiv, Ukraine—plus offices across the UK, France, Japan, and the US—the company operates as a nexus for innovators, turning breakthroughs into scalable successes.

At the forefront of this mission is Cognitum, ChipStart’s groundbreaking Edge AI processor unveiled in September 2025. Designed explicitly for the exigencies of edge computing, Cognitum delivers an industry-leading 65 TOPS (tera operations per second) peak performance while consuming under 2 watts—redefining the performance-to-power ratio. In a market overview matrix, Cognitum eclipses competitors like BrainChip’s Akida, Hailo processors, and even Qualcomm’s Snapdragon in the ultra-low-power quadrant, positioning it as the “sweet spot” for high-performance edge AI without the thermal and efficiency pitfalls of legacy solutions. This isn’t incremental improvement; it’s a paradigm shift, addressing core Edge AI challenges: the demand for ultra-low-power, on-device learning; real-time inference for latency-sensitive applications; and the scaling of deep neural networks (DNNs) that once required cloud dependency.

Cognitum’s architecture is a masterclass in versatility, one unified design that scales seamlessly across edge devices and nodes, infusing remote intelligence into diverse ecosystems. For security applications, its sub-2W footprint enables always-on monitoring in surveillance cameras, efficiently streaming inferences for continuous threat detection without power drain. Lower power means cooler operation, extending device lifespan, while flexible multi-model switching allows real-time adaptation between facial recognition, object detection, or event analysis ideal for retail, public safety, and smart city platforms.

In agriculture (AgTech), Cognitum empowers solar- or battery-operated field devices, supporting energy-efficient intelligence for crop surveillance drones and livestock monitoring. Farmers can fine-tune models to local conditions like soil types and weather patterns, reducing maintenance needs and runtime servicing. A single processor sustains adaptive AI in harsh environments, slashing operational costs and boosting yields through proactive insights.

Wearables and AR/VR headsets benefit from Cognitum’s modular, embeddable silicon, optimized for vision and biometric workloads. Its all-day performance extends battery life in fitness trackers and immersive devices, while immersing users in gesture, vital signs, and motion AI—transforming wearables from passive gadgets into proactive companions.

Industrial automation sees Cognitum as a retrofit powerhouse, integrating legacy and next-gen systems for unified shop-floor management. Predictive analytics fused with vision AI enable real-time decision-making, improving uptime in Industry 4.0 setups. Durable, efficient nodes keep robotic arms and IoT sensors running longer, minimizing downtime in predictive maintenance.

Looking ahead, Cognitum is primed to power the future of agentic AI—systems that are proactive, autonomous, targeted, and collaborative via the PACT framework. As rUv, a leading AI innovator, proclaimed in August 2025:

“Building a neural network with cost-effective, ultra-low-power Edge AI inferencing nodes is a game-changer!”

With highly scalable, cost-effective connectivity and ultra-low-power nodes, Cognitum supports expansive AI clusters without the energy overhead of traditional copper interconnects.

TekStart’s ethos empowering innovators through ChipStart’s accelerated silicon success shines in Cognitum. It redefines Edge AI as accessible, efficient, and boundless, fostering new possibilities in surveillance, AgTech, wearables, and industrial realms. As Gordon Benzie, VP of Marketing, invites: Reach out at gordon@tekstart.com to explore how Cognitum can electrify your next breakthrough.

Bottom line: In a world racing toward intelligent edges, TekStart isn’t just building processors—it’s architecting the intelligent tomorrow.

For more information contact TekStart.

Also Read:

CEO Interview with Howard Pakosh of TekStart

Scaling Debug Wisdom with Bronco AI

Arm Lumex Pushes Further into Standalone GenAI on Mobile


CAST Simplifies RISC-V Embedded Processor IP Adoption with New Catalyst Program

CAST Simplifies RISC-V Embedded Processor IP Adoption with New Catalyst Program
by Daniel Nenni on 10-26-2025 at 10:00 am

CAST IP SemiWiki RISC V Summit 2025

In a move poised to accelerate the integration of open-source processor architectures into resource-constrained devices, semiconductor IP provider CAST, Inc. unveiled its Catalyst™ Program at the RISC-V Summit in Santa Clara, California. This initiative addresses a persistent pain point for embedded system developers: the overwhelming configurability of RISC-V processors. By offering pre-tuned, ready-to-deploy IP cores alongside flexible licensing and expert support, CAST aims to strip away the complexity, enabling faster prototyping and deployment in low-power, cost-sensitive applications like IoT sensors, wearables, and industrial controllers.

RISC-V, the royalty-free instruction set architecture, has exploded in popularity since its inception in 2010, promising customization and vendor neutrality in a market dominated by proprietary cores from Arm and others. Yet, its flexibility with hundreds of extensions and options can paralyze small teams lacking deep expertise. CAST, a veteran IP firm founded in 1993 with a portfolio spanning processors, security modules, and interfaces, leverages its 30 years of experience to bridge this gap. Sourcing cores from partners like Beyond Semiconductor and Fraunhofer IPMS, the Catalyst Program delivers silicon-proven 32-bit RISC-V solutions optimized for embedded realities, not exhaustive versatility.

At the heart of Catalyst are five meticulously preconfigured processor cores, each tailored to distinct embedded use cases. These eliminate the “configuration overload” that Bill Finch, CAST’s senior business development vice president, describes as unnecessary for 90% of developers.

The lineup spans from ultra-minimalist to feature-rich designs:
  • BA5x-TN: A tiny MCU-class core (~17k gates) for basic control tasks, such as sensor hubs or always-on logic, serving as a seamless 8/16-bit replacement with robust performance in minimal silicon area.
  • BA5x-LP: A low-power workhorse with configurable caches, interrupts, and bus interfaces, ideal for general-purpose industrial systems demanding balanced efficiency.
  • BA5x-CM: RTOS-ready with hardware floating-point and atomic operations, targeting connected IoT and wearables where multitasking meets real-time constraints.
  • BA5x-EP: An edge powerhouse featuring dual/single-precision FPU, MMU, and optional hypervisor for secure, multi-OS environments in gateways or complex controllers.
  • EMSA5-GP: Vector-enhanced for parallel workloads like FFTs, sensor fusion, and packet processing, adding RISC-V Vector extension subsets without bloating area.

These cores promise predictable power, performance, and area metrics, verified through CAST’s rigorous validation. Developers can integrate them into ASICs or FPGAs immediately, bypassing weeks of tweaking.

What truly distinguishes Catalyst is its risk-mitigating ecosystem. Licensing flips the traditional model: no upfront fees for full-featured evaluation kits, allowing simulation, benchmarking, and design integration sans commitment. Only upon production commitment does the fee apply, with transparent terms, no royalties, and optional extended support. Direct access to CAST’s senior engineers during evaluation, guiding customization and tuning, ensures smooth onboarding. As Evan Price, CAST’s RISC-V product manager, puts it: “Catalyst reduces complications to let embedded system teams focus on innovation… without surprises.”

This program arrives at a pivotal moment. The embedded market, projected to exceed $150 billion by 2030, demands RISC-V’s cost advantages amid supply chain volatility and geopolitical tensions over proprietary IP. Yet adoption lags in ultra-low-power segments due to integration hurdles. Catalyst democratizes access, potentially slashing time-to-market by months and lowering barriers for startups and mid-tier firms. Early adopters could see 20-30% area reductions versus over-configured alternatives, per CAST benchmarks, while inheriting battle-tested reliability from hundreds of global deployments.

Bottom line:  CAST’s Catalyst Program isn’t mere IP; it’s a launchpad for RISC-V’s embedded renaissance. By fusing proven tech with developer-friendly economics, it could catalyze widespread adoption, fostering innovation in edge AI, smart cities, and beyond. As the industry pivots to open architectures, programs like this ensure no team is left configuring in the dark. For details, visit CAST’s RISC-V page or reach out to their sales team—your next embedded breakthrough awaits.

Contact CAST Here

Also Read:

RANiX Employs CAST’s TSN IP Core in Revolutionary Automotive Antenna System

CAST Webinar About Supercharging Your Systems with Lossless Data Compression IPs

Podcast EP273: An Overview of the RISC-V Market and CAST’s unique Abilities to Grow the Market with Evan Price