SNPS1670747138 DAC 2025 800x100px HRes

Unlocking the cloud: A new era for post-tapeout flow for semiconductor manufacturing

Unlocking the cloud: A new era for post-tapeout flow for semiconductor manufacturing
by Bassem Riad on 03-04-2025 at 6:00 am

figure2 FullScale

As semiconductor chips shrink and design complexity skyrockets, managing post-tapeout flow (PTOF) jobs has become one of the most compute-intensive tasks in manufacturing. Advanced computational lithography demands an enormous amount of computing power, putting traditional in-house resources to the test. Enter the cloud—an agile, scalable solution with hundreds of compute options, set to revolutionize how foundries manage PTOF workloads.

The unpredictability problem: Bridging the gap in resources

For years, foundries have relied on powerful in-house resources to handle PTOF tasks. But PTOF workloads aren’t consistent—sometimes demand surges, leading to waiting queues that delay production, while at other times, costly resources sit idle. Expanding on-premises infrastructure to match peak demand is both costly and slow, often taking months to deploy. In an industry where every day counts, finding a flexible solution is essential. This is where the cloud steps in, offering dynamic scaling and the freedom to match resources with demand as needed.

Cloud elasticity: Pay only for what you need

This on-demand scaling means foundries no longer must overprovision or commit to massive hardware investments upfront. Cloud platforms are transforming PTOF workflows by allowing foundries to pay only for what they use. With infrastructure managed by cloud providers, teams can shift their focus to developing applications and improving customer engagement while resources expand, or contract as needed. Cloud services offer semiconductor companies access to a global network of tools, empowering them to adapt quickly and push the boundaries of innovation.

Scaling up seamlessly: Siemens EDA and AWS join forces

This vision of agility and scalability became a reality in July 2023, when Siemens EDA and AWS signed a Strategic Collaboration Agreement to accelerate EDA workloads in the cloud. Out of this partnership came Cloud Flight Plans—automation scripts and best practices that streamline EDA deployment on AWS. Now, semiconductor manufacturers can effortlessly scale up resources, deploying hundreds of thousands of cores on demand. No more waiting months to expand data centers; cloud resources are available instantly, without capital investments or maintenance.

Building the foundation: A reference architecture for PTOF in the cloud

This agility is enhanced by Siemens EDA’s Cloud Reference Environment, an architecture purpose-built to handle PTOF jobs on AWS. Designed with secure principles and optimized for seamless workload management, this setup dynamically scales resources based on current demand. A central management system allocates resources to high-priority jobs and quickly redirects any underutilized capacity. Real-time spending insights empower semiconductor companies to control their cloud costs, ensuring resources are optimized at every step and that budget surprises are a thing of the past.

Real-time cost control with CalCM+: Smart scaling for smarter budgets

But it’s not just about scaling—it’s also about managing those costs smartly. Enter CalCM+, a

solution for maximizing cloud efficiency of Calibre PTOF jobs. Central to CalCM+ is adaptive resource management, which monitors active jobs and allocates resources based on actual demand. This intelligent scaling ensures resources aren’t wasted on overprovisioning, keeping budgets lean.

At the heart of CalCM+ is the cost calculation app, offering real-time spending insights by integrating directly with AWS pricing and the Slurm scheduler. Teams can track job costs in real-time, make informed decisions, and optimize resources based on precise needs. A recent study (see chart below) highlights how CalCM+ delivers measurable cost savings through smart scaling and predictive insights, proving that cloud efficiency is as much about cost control as it is about performance.

Data-driven insights: Predicting the future of resource use

CalCM+ goes a step further with a data analysis module that records usage metrics and job metadata, enabling predictions for future jobs. By studying historical data, this tool provides insights into expected runtime and memory usage, allowing teams to pick the best instance types for each task.

Lean Computing 

The AUTOREVOKECYCLE feature dynamically releases underutilized CPUs and reallocates them to high-demand jobs. This lean computing approach doesn’t just keep costs down—it ensures resources are used precisely where they’re needed, avoiding the waste that comes from overprovisioning. Figure 1 shows the effect of using the AUTOREVOKECYCLE feature.

Figure 1. The AUTOREVOKECYCLE feature dynamically releases underutilized CPUs and reallocates them to high-demand jobs.

Cost savings through the power of spot instances

Adding to the cost-saving toolkit is the cloud’s ability to offer dynamic pricing. Foundries can now use spot instances to run high-performance tasks at a fraction of the regular cost. These spot instances, ideal for peak demand, tap into unused cloud capacity at lower rates, helping companies stay within budget without compromising performance.

FullScale processing: Speeding up time-to-tapeout

Cloud elasticity also shines with Calibre FullScale high-throughput processing capabilities, a compelling answer to the compute-intensive demands of PTOF. By enabling parallel lithography simulations, Calibre FullScale slashes job completion times, making faster tapeouts more attainable than ever. With the flexibility to adjust resources based on cost and performance needs, FullScale delivers optimal efficiency, ensuring every task is completed on schedule and with maximum precision (figure 2).

Figure 2. Calibre FullScale speeds time to tapeout.

Tapping into GPU power: Acceleration for compute-intensive tasks

For leading-edge technology nodes, the availability of GPU instances in the cloud is a game-changer. Compute-intensive tasks—like lithography, etch, and e-beam simulations—now run with hardware-accelerated performance, reducing runtimes dramatically. With GPU acceleration, manufacturers can conduct highly detailed simulations that were previously limited by on-premises constraints. The cloud’s GPU capabilities bring precision and scale, redefining what’s possible in PTOF simulations.

Cloud-native orchestration: The Kubernetes advantage

Orchestration systems like Kubernetes are also part of this cloud-driven transformation. Siemens EDA’s solutions leverage container orchestration to enable seamless job distribution across cloud resources. With Kubernetes automating deployment, scaling, and workload management, running complex Calibre PTOF jobs becomes effortless, whether on-premises or in the cloud. This cloud-native execution model maximizes resource use, delivering scalability, efficiency, and flexibility for semiconductor manufacturers.

A new era for semiconductor manufacturing

As semiconductor manufacturing embraces the cloud, a new era is taking shape—one where agility, efficiency, and cost control redefine the way PTOF tasks are managed. With the flexibility to scale on demand, optimize budgets, and orchestrate workloads seamlessly, cloud-based PTOF workflows are setting new standards. By tapping into cloud capabilities, container orchestration, and GPU resources, semiconductor manufacturers gain the edge needed to drive innovation, speed time-to-market, and thrive in an ever-evolving industry.

For a deep dive into this PTOF cloud flow, please see the technical paper, Crush Semi-manufacturing runtimes with Calibre in the cloud.

Bassem is a cloud product engineer specializing in scalable and cost-efficient computing solutions for semiconductor design and manufacturing. With expertise in Kubernetes, high-performance computing, and cloud infrastructure, Bassem focuses on optimizing post-tapeout workflows, EDA tool deployment, and hybrid cloud strategies.

Also Read:

Getting Faster DRC Results with a New Approach

Full Spectrum Transient Noise: A must have sign-off analysis for silicon success

PSS and UVM Work Together for System-Level Verification

Averting Hacks of PCIe® Transport using CMA/SPDM and Advanced Cryptographic Techniques


SemiWiki Outlook 2025 with yieldHUB Founder & CEO John O’Donnell

SemiWiki Outlook 2025 with yieldHUB Founder & CEO John O’Donnell
by Daniel Nenni on 03-03-2025 at 10:00 am

John O’Donnell YieldHUB SemiWiki

What was the most exciting high point of 2024 for your company?

One of the most exciting milestones in 2024 was the further expansion of our data science team, which allowed us to take a bold step toward fully integrating AI into our solutions. This not only is enhancing our offerings but also helped us grow within our existing customer base.

Another highlight for yieldHUB was attracting new and strategic customers, for example those developing AI chips and others involved in onshoring testing in the USA and Europe.

What was the biggest challenge your company faced in 2024?

The biggest challenge in 2024 was how to keep developing yieldHUB’s next-generation platform while meeting the increasing demand for our current platform as we added new customers.

How is your company’s work addressing this biggest challenge?

We expanded our R&D and customer success teams to accelerate the new platform’s progress while ensuring that our customers continued to receive top-tier support and service. Maintaining strong customer relationships and responsiveness remains a top priority.

Question: What do you think the biggest growth area for 2025 will be, and why?

We have a new product coming out soon called yieldHUB Live, our AI driven, tester-agnostic real time monitoring system for test and probe. It speeds up testing by recommending to the operator what to do when there are issues. It also allows in-depth remote monitoring of the test/probe floor and tracks key parameters that reflect the integrity, or not, of testing and trimming. The demand for real-time insights is increasing and we believe yieldHUB Live will be a game-changer for test houses as the time that lots will spend on hold will greatly decrease and fewer testers will need to be bought when volumes increase again.

How is your company’s work addressing this growth?

We’ve worked hard to ensure yieldHUB Live, although complex behind the scenes, is simple to implement on any tester type, but is also scalable and exceptionally reliable. So once setup, it can quickly fan out in days to hundreds of testers as it requires no additional hardware.

Question: What conferences did you attend in 2024 and how was the traffic?

We participated in several key industry events in 2024, including ITC Test Week, Semicon West, International Microwave Symposium, PSECE, the Annual NMI Conference, IEEE VLSI Test Symposium, and the Semiconductor Wafer Test Expo. Attendance was strong across all these events, and we had great engagement with both existing and potential customers.

Question: Will you attend conferences in 2025? Same or more?

Absolutely! We’ve already confirmed that we’ll be exhibiting at the NMI Annual Conference (UK), Semicon West, ITC, and PSECE (Philippines), with plans to attend additional events throughout the year. We recently became a member of Silicon Saxony so the plan is to expand our presence in Germany and the EU.

Question: How do customers engage with your company?

We like to make sure that all yieldHUB customers receive exceptional support and value at every stage. Our dedicated Customer Success team is committed to providing proactive, personalized assistance, and our exclusive library of tools and resources empowers customers to maximize the benefits of our solutions.

New customers receive comprehensive online training and all customers have access to our highly efficient ticketing system, ensuring that any inquiries or issues are addressed swiftly. In fact, our median first response time in 2024 was just 5 minutes, meaning customers hear from one of our engineers almost instantly:

https://www.yieldhub.com/request-a-demo/

Beyond reactive support, we prioritize ongoing engagement. Our Director of Customer Success, Michael Clarke, regularly connects with customers via face-to-face video calls to ensure they are fully supported and to gain valuable feedback.

The results speak for themselves: Our customer satisfaction rating for closed tickets in 2024 was an impressive 95%, far exceeding the global benchmark of 74%. This level of responsiveness and care is another area that sets yieldHUB apart and we’re committed to continuing this high standard in 2025 and beyond.

Additional questions or final comments?

We’re excited for what’s to come in the next two years. Our focus remains on delivering cutting-edge AI-driven data analytics that empower semiconductor companies, especially at the test stage,  to improve efficiency and maximize profitability. We look forward to continuing our journey with customers, partners and the industry as a whole!

Talk to a yield expert
Also Read:

yieldHUB Improves Semiconductor Product Quality for All

Podcast EP167: What is Dirty Data and How YieldHUB Helps Fix It With Carl Moore

Podcast EP181: A Tour of yieldHUB’s Operation and Impact with Carl Moore

Podcast EP243: What is Yield Management and Why it is Important for Success with Kevin Robinson

Podcast EP254: How Genealogy Correlation Can Uncover New Design Insights and improvements with yieldHUB’s Kevin Robinson


Powering the Future: How Engineered Substrates and Material Innovation Drive the Semiconductor Revolution

Powering the Future: How Engineered Substrates and Material Innovation Drive the Semiconductor Revolution
by Kalar Rajendiran on 03-03-2025 at 6:00 am

Substrate Vision Summit Engineered Substrate Panel Session

Engineered substrate technology is driving an evolution within the semiconductor industry. As Moore’s Law reaches its limits, the focus is shifting from traditional planar wafer scaling to innovative material engineering and 3D integration. Companies like Soitec, Intel and Samsung are pioneering this transition, unlocking new levels of performance, efficiency, and scalability.

The topic of engineered substrates and material innovation was the focus of an interesting panel discussion at the Substrate Vision Summit 2025. Daniel Nenni, Founder of SemiWiki.com, moderated the session. SemiWiki.com is a popular online platform featuring an active discussion forum dedicated to semiconductors. Christophe Maleville, CTO & SEVP of Innovation at Soitec, David Thompson, VP Technology Research at Intel, and Kelvin Low, VP Market Intelligence & Business Development at Samsung Foundry, were the panelists.

Engineered Substrates: Changing the Competitive Landscape

One of the most compelling advantages of engineered substrates is the ability to preinstall critical performance elements into the wafer itself. By embedding functionality at the substrate level, chip designers can achieve significant improvements in efficiency and power savings.

A clear example of this was shown several years ago with RF-SOI wafers, where Soitec proved how a 2G design achieved 3G-level performance simply by switching to an RF-SOI wafer. This breakthrough provided GaAs-like performance without using GaAs technology, proving the potential of engineered wafers across various applications. Such advancements not only enhance performance but also accelerate product development cycles and reduce design complexity.

Addressing Challenges of Engineered Wafers

Semiconductor manufacturers face two major cost components: the cost of processing the wafer (internally or through procurement) and the cost of time (technology development cycles, learning curves, and integration challenges).

If every manufacturer were to independently develop SOI wafer technology, it would be an inefficient process with a steep learning curve. Instead, by relying on specialized providers like Soitec, foundries and chipmakers can source mature, high-performance engineered substrates and focus on differentiation at the chip level. This ecosystem-driven approach accelerates technology readiness and product development while ensuring cost efficiency.

Foundry Adoption and Market Demand

Foundries are recognizing the strategic importance of engineered substrates, particularly for Fully Depleted SOI (FD-SOI) technology. Samsung Foundry, a key player in this space, has already adopted 28FD-SOI in high-volume production at its Austin, TX fab, with customers like NXP and Lattice leveraging its benefits. Furthermore, Samsung is expanding its FD-SOI capacity to meet rising demand, while GlobalFoundries has also joined the ecosystem, reinforcing the technology’s viability. 18FD-SOI is on Samsung Foundry’s roadmap with ST Microelectronics as the lead customer.

Despite early concerns about cost and supply chain stability, FD-SOI has proven to be a compelling solution for applications that can manipulate body-biasing to achieve low power and high efficiency. Soitec has further addressed adoption challenges by investing in design infrastructure—including the acquisition of Dolphin Integration—to enhance support for SOI-based designs.

The 3D Future of Engineered Wafers

Both Soitec and Intel are embracing the 3D way of building engineered wafers. Soitec is advancing Smart Cut™ technology to enable precise layer transfer, facilitating hybrid bonding and wafer stacking for 3D integration. Intel, on the other hand, is developing Foveros 3D stacking, which enables transistors and logic units to be vertically integrated for improved performance and energy efficiency.

Unlike the traditional planar approach, where transistors are arranged side by side, the 3D method stacks layers vertically, reducing interconnect distances and power consumption. This shift is critical for sustaining Moore’s Law and ensuring future generations of semiconductors meet the growing demands of AI, high-performance computing, and edge applications.

Standardization and Scalability: Key to Mass Adoption

The conversation around wafer size standardization is evolving, but the real challenge lies in standardizing die-to-die interconnects for chiplet-based designs. UCIe (Universal Chiplet Interconnect Express) is leading this initiative, enabling interoperability across different foundries and manufacturers.

From an economic standpoint, scaling wafer size can yield more dies per wafer. For engineered materials like SiC or GaN, the cost-benefit analysis varies. A 300mm GaN substrate, for example, can achieve 20X Figure of Merit improvement over a 200mm GaN wafer, demonstrating the potential for engineered substrates to revolutionize power electronics and RF applications.

Value Creation Beyond Die Cost

Ultimately, the value of engineered substrates extends beyond raw die cost. By enhancing performance, reducing power consumption, and enabling new system architectures, these wafers deliver system-wide cost savings and new application possibilities. Without this broader perspective, certain technologies—such as SiC for power electronics—would struggle to establish a strong business case based solely on die cost.

Summary

As the semiconductor industry moves toward a 3D future, engineered substrates are becoming a strategic enabler of next-generation computing. Preinstalling critical performance elements into the wafer itself is helping redefine what’s possible in chip design. Foundries are embracing FD-SOI, and the push for larger, high-performance wafers is opening the door for more efficient, scalable, and cost-effective semiconductor manufacturing.

With increasing demand for AI, 5G, automotive, and high-performance computing, engineered substrates will be at the heart of the semiconductor industry’s next wave of innovation. The companies that leverage this technology early will be the ones shaping the future of computing.

Also Read:

Soitec: Materializing Future Innovations in Semiconductors

I will see you at the Substrate Vision Summit in Santa Clara

EVs, Silicon Carbide & Soitec’s SmartSiC™: The High-Tech Spark Driving the Future (with a Twist!)


Podcast EP276: How Alphawave Semi is Fueling the Next Generation of AI Systems with Letizia Giuliano

Podcast EP276: How Alphawave Semi is Fueling the Next Generation of AI Systems with Letizia Giuliano
by Daniel Nenni on 02-28-2025 at 10:00 am

Dan is joined by Letizia Giuliano, Vice President of Product Marketing and Management at Alphawave Semi. She specializes in architecting cutting-edge solutions for high-speed connectivity and chiplet design architecture. Prior to her role at Alphawave Semi, Letizia held the position of Product Line Manager at Intel, where she facilitated the integration of complex IP for external customers, as well as within Intel’s graphics and CPU products. With a background in Electrical Engineering, Letizia has contributed significantly to her field through technical papers, presentations at conferences and her involvement in defining industry standards like OpenHBI and UCIe.

Dan explores the unique and demanding requirements for next generation systems with Letizia. The need for a platform approach that addresses high-performance connectivity requirements is discussed. The role of advanced interface support through IP, chiplets and custom silicon is examined with respect to the need to scale up and scale out new systems with higher quality, reliability and shorter time to market.

Letizia describes the broad offerings Alphawave Semi is bringing to market to address these challenges. The current and future impact of this technology is explored.

The views, thoughts, and opinions expressed in these podcasts belong solely to the speaker, and not to the speaker’s employer, organization, committee or any other group or individual.

 


CEO Interview: Dr. Andreas Kuehlmann of Cycuity

CEO Interview: Dr. Andreas Kuehlmann of Cycuity
by Daniel Nenni on 02-28-2025 at 8:00 am

Andreas 2022 Headshot cropped (2)

Dr. Andreas Kuehlmann, Executive Chairman and CEO at Cycuity, has spent his career across the fields of semiconductor design, software development, and cybersecurity. Prior to joining Cycuity, he helped build a market-leading software security business as head of engineering at Coverity and, after its acquisition by Synopsys, as General Manager of the newly formed Software Integrity business unit. In that role he led its growth from double-digit-millions to a multi-hundred-million-dollar business. He also previously worked at IBM Research and Cadence Design Systems, where he made influential contributions to hardware verification. Dr. Kuehlmann served as an adjunct professor at UC Berkeley’s Department of Engineering and Computer Science for 14 years, and received a Ph.D. in Electrical Engineering from the Technical University Ilmenau, Germany.

Tell us about your company?

Cycuity provides software products and services to specify and verify semiconductor device security. We help customers to ensure that security weaknesses are identified and mitigated during the design phase before manufacturing. Our security solutions are a critical element in the semiconductor product ecosystem for commercial and defense industries. They provide the broadest security assurance for semiconductor development across the design supply chain from secure usage of third-party IPs (3PIP) to full chips, including firmware. Cycuity’s products fit smoothly into existing design flows and utilize the simulation and emulation products of all three EDA vendors: Synopsys, Cadence, and Siemens EDA. Furthermore, our technology is applied to perform advanced security assessments of legacy hardware components in existing systems.

What problems are you solving?

Security threats in modern hardware systems are complex, rapidly evolving, and often overlooked during the early stages of design. The Radix platform directly addresses these challenges by identifying security weaknesses and unexpected behaviors early in the chip design lifecycle, minimizing the risk of escapes that lead to potential exploits. Traditional verification tools frequently fall short in providing complete security coverage across hardware, firmware, and software. Radix closes this gap by delivering a comprehensive security verification solution that spans the entire system from block level to software.

Radix’s systematic approach allows teams to develop security measures effectively and document their proper functioning with full transparency. Radix transforms security assurance from a fragmented and reactive process into a proactive, scalable, and fully traceable solution.

What application areas are your strongest?

We excel in delivering quantifiable assurance for semiconductors used in critical applications across industries, especially for high-stakes applications in defense, automotive, and IoT where security and reliability are paramount.

What keeps your customers up at night?

Our customers are concerned about ensuring the security and resilience of their semiconductor chips and embedded systems. What keeps them up at night is the thought of receiving a call from one of their customers reporting a security vulnerability in a chip that is broadly deployed in many products. Besides the impact on their brand, the cost of remediating a hardware security flaw can be extremely high. Moreover, customers are concerned about delivering secure semiconductors which comply with increasingly stringent industry standards. We address these challenges head-on by providing quantifiable assurance and robust security practices. Our solutions empower customers to achieve confidence in their designs, so they can focus on innovation without compromising on security.

What does the competitive landscape look like and how do you differentiate?

Cycuity is uniquely positioned as a thought leader and innovator of hardware security solutions. We have demonstrated our commitment to the development of secure and resilient microelectronics for defense and commercial applications. Our Radix platform goes beyond the typical “pass or fail” checks. It offers unmatched security design support through advanced security exploration and analysis capabilities, as well as scalable and traceable security verification – helping to more effectively and efficiently achieve security signoff and prove compliance.

What new features/technology are you working on?

We’ve got some exciting new features coming soon—check back next month for the details. For now, we can share a bit about Radix’s unique exploration capabilities which help security and verification teams to better understand chip designs from a security perspective. Unlike functional security verification, which is aimed at ensuring that a required set of security features are correctly implemented, security exploration is focused on investigating unknown or unintended side effects of security functions that are not entirely understood but could lead to security weaknesses or vulnerabilities. Security exploration with Radix offers powerful analysis and graphical visualization capabilities to reveal unexpected security behaviors that cannot be observed with traditional design tools. Even if the unexpected behavior turns out to be benign, fully  analyzing and deeply understanding it serves as a powerful confirmation of the overall design intent.

How do customers normally engage with your company?

Many customers come to us with the need of building a comprehensive chip security program, often starting from scratch. Security is not like flipping a switch or using a software product. It is rather a journey on which we help customers to progress starting with training, security requirement development, tool selection, integration and production ramping to documentation and signing off security for manufacturing.

Talk to a Security Expert

Also Read:

Cycuity at the 2024 Design Automation Conference

Hardware Security in Medical Devices has not been a Priority — But it Should Be


The Double-Edged Sword of AI Processors: Batch Sizes, Token Rates, and the Hardware Hurdles in Large Language Model Processing

The Double-Edged Sword of AI Processors: Batch Sizes, Token Rates, and the Hardware Hurdles in Large Language Model Processing
by Lauro Rizzatti on 02-27-2025 at 10:00 am

Accelerated,Computing, ,Parallel,Processing,To,Speed,Up,Work,On

Unlike traditional software programming, AI software modeling represents a transformative paradigm shift, reshaping methodologies, redefining execution processes, and driving significant advancements in AI processors requirements.

Software Programming versus AI Modeling: A Fundamental Paradigm Shift

Traditional Software Programming
Traditional software programming is built around crafting explicit instructions (code) to accomplish specific tasks. The programmer establishes the software’s behavior by defining a rigid set of rules, making this approach ideal for deterministic scenarios where predictability and reliability are paramount. As tasks become more complex, the codebase often grows in size and complexity.

When updates or changes are necessary, programmers must manually modify the code—adding, altering, or removing instructions as needed. This process provides precise control over the software but limits its ability to adapt dynamically to unforeseen circumstances without direct intervention from a programmer.

AI Software Modeling
AI software modeling represents a fundamental shift in how to approach problem solving. AI software modeling enables systems to learn patterns from data through iterative training. During training, AI analyzes vast datasets to identify behaviors, then applies this knowledge in the inference phase to perform tasks like translation, financial analysis, medical diagnosis, and industrial optimization.

Using probabilistic reasoning, AI makes predictions and decisions based on probabilities, allowing it to handle uncertainty and adapt. Continuous fine-tuning with new data enhances accuracy and adaptability, making AI a powerful tool for solving complex real-world challenges.

The complexity of AI systems lies not in the amount of written code but in the architecture and scale of the models themselves. Advanced AI models, such as large language models (LLMs), may contain hundreds of billions or even trillions of parameters. These parameters are processed using multidimensional matrix mathematics, with precision or quantization levels ranging from 4-bit integers to 64-bit floating-point calculations. While the core mathematical operations, namely, multiplications and additions (MAC), are rather simple, they are performed millions of times across large datasets with all parameters processed simultaneously during each clock cycle.

Software Programming versus AI Modeling: Implications on Processing Hardware

Central Processing Unit (CPU)
For decades, the dominant architecture used to execute software programs has been the CPU, originally conceptualized by John von Neumann in 1945. The CPU processes software instructions sequentially—executing one line of code after another—limiting its speed to the efficiency of this serial execution. To improve performance, modern CPUs employ multicore and multi-threading architectures. By breaking down the instruction sequence into smaller blocks, these processors distribute tasks across multiple cores and threads, enabling parallel processing. However, even with these advancements, CPUs remain limited in their computational power, lacking the enormous parallelism required to process AI models.

The most advanced CPUs achieve computational power of a few GigaFLOPS and feature memory capacities reaching a few TeraBytes in high-end servers, with memory bandwidths peaking at 500 GigaBytes per second.

AI Accelerators
Overcoming CPU limitations requires a massively parallel computational architecture capable of executing millions of basic MAC operations on vast amounts of data in a single clock cycle.

Today, Graphics Processing Units (GPUs) have become the backbone of AI workloads, thanks to their unparalleled ability to execute massively parallel computations. Unlike CPUs, which are optimized for general-purpose tasks, GPUs prioritize throughput, delivering performance in the range of petaFLOPS—often two orders of magnitude higher than even the most powerful CPUs.

However, this exceptional performance comes with trade-offs, particularly depending on the AI workload: training versus inference. GPUs can experience efficiency bottlenecks when handling large datasets, a limitation that significantly impacts inference but is less critical for training. LLMs like GPT-4, OpenAI’s o1/o3, Llama 3-405B, and DeepSeek-V3/R1 can dramatically reduce GPU efficiency. A GPU with a theoretical peak performance of one petaFLOP may deliver only 50 teraFLOPS when running GPT-4. While this inefficiency is manageable during training, where completion matters more than real-time performance, it becomes a pressing issue for inference, where latency and power efficiency are crucial.

Another major drawback of GPUs is their substantial power consumption, which raises sustainability concerns, especially for inference in large-scale deployments. The energy demands of AI data centers have become a growing challenge, prompting the industry to seek more efficient alternatives.

To overcome these inefficiencies, the industry is rapidly developing specialized AI accelerators, such as application-specific integrated circuits (ASICs). These purpose-built chips offer significant advantages in both computational efficiency and energy consumption, making them a promising alternative for the next generation of AI processing. As AI workloads continue to evolve, the shift toward custom hardware solutions is poised to reshape the landscape of artificial intelligence infrastructure. See Table I.

Attributes Software Programming AI Software Modeling
Application Objectives Deterministic and Targeted Tasks PredictiveAI and GenerativeAI
Flexibility/Adaptability Rule-based and Rigid Data-driven Learning and Evolving
SW Development Specific Programming Languages Data Science, ML, SW Engineering
Processing Method Sequential Processing Non-linear, Heavily Parallel Processing
Processor Architecture CPUs GPUs and Custom ASICs

Table I summarizes the main differences between traditional software programming vis-à-vis AI software modeling.

Source: VSORA

Key and Unique Attributes of AI Accelerators

The massively parallel architecture of AI processors possesses distinct attributes not found in traditional CPUs. Specifically, two key metrics are crucial for the accelerator’s ability to deliver the performance required to process AI workloads, such as LLMs: batch sizes and token throughput. Achieving target levels for these metrics presents engineering challenges.

Batch Sizes and the Impact on Accelerator Efficiency

Batch size refers to the number of independent inputs or queries processed concurrently by the accelerator.

Memory Bandwidth and Capacity Bottlenecks

In general, larger batches improve throughput by better utilizing parallel processing cores. As batch sizes increase, so do memory bandwidth and capacity requirements. Excessively large batches can lead to cache misses and increased memory access latency, thus hindering performance.

Latency Sensitivity

Large batch sizes affect latency because the processor must handle significantly larger datasets simultaneously, increasing computation time. Real-time applications, such as autonomous driving, demand minimal latency, often requiring a batch size of one to ensure immediate response. In safety-critical scenarios, even a slight delay can lead to catastrophic consequences. However, this presents a challenge for accelerators optimized for high throughput, as they are typically designed to process large batches efficiently rather than single-instance workloads.

Continuous Batching Challenges
Continuous batching is a technique where new inputs are dynamically added to a batch as processing progresses, rather than waiting for a full batch to be assembled before execution. This approach reduces latency and improves throughput. It may have an impact on time-to-first token, but provided that the scheduler can handle the execution it achieves higher overall efficiency.

Token Throughput and Its Computational Impact

Token throughput refers to the number of tokens—whether words, sub-words, pixels, or data points—processed per second. It depends on input token sizes and output token rates, requiring high computational efficiency and optimized data movement to prevent bottlenecks.

Token Throughput Requirements
Key to defining token throughput in LLMs is the time to first token output, namely low latency achieved through continuous batching to minimize delays. For traditional LLMs, the output rate must exceed human reading speed, while for agentic AI that relies on direct machine-to-machine communication, maintaining high throughput is critical.

Traditional Transformers vs Incremental Transformers
Most LLMs, such as OpenAI-o1, LLAMA, Falcon, and Mistral, use transformers, which require each token to attend to all previous tokens. This leads to high computational and memory costs. Incremental Transformers offer an alternative by computing tokens sequentially rather than recomputing the full sequence at every step. This approach improves efficiency in streaming inference and real-time applications. However, it requires storing intermediate state data, increasing memory demands and data movement, which impacts throughput, latency, and power consumption.

Further Considerations
Token processing also presents several challenges. Irregular token patterns, such as varying sentence and frame lengths, can disrupt optimized hardware pipelines. Additionally, in autoregressive models, token dependencies can cause stalls in the processing pipeline, reducing the effective utilization of computational resources.

Overcoming Hurdles in Hardware Accelerators
In stark contrast to the CPU that has undergone a remarkable evolutionary journey over the past 70 years, AI accelerators are still in their formative stage, with no established architecture capable of overcoming  all the hurdles in meeting the computational demands of LLMs.

The most critical bottleneck is memory bandwidth, often referred to as the memory wall. Large batches require substantial memory capacity to store input data, intermediate states and activations, while demanding high data transfer bandwidth. Achieving high token throughput depends on fast data transfer between memory and processing units. When memory bandwidth is insufficient, latency increases, and throughput declines. These bottlenecks become a major constraint on computing efficiency, limiting the actual performance to a fraction of the theoretical maximum.

Beyond memory constraints, computational bottlenecks pose another challenge. LLMs rely on highly parallelized matrix operations and attention mechanisms, both of which demand significant computational power. High token throughput further intensifies the need for fast processing performance to maintain smooth data flow.

Data access patterns in large batches introduce additional complexities. Irregular access patterns can lead to frequent cache misses and increased memory access latencies. To sustain high token throughput, efficient data prefetching and reuse strategies are essential to minimize memory overhead and maintain consistent performance.

Addressing these challenges requires innovative memory architectures, optimized dataflow strategies, and specialized hardware designs that balance memory and computational efficiency.

Overcoming the Memory Wall
Advancements in memory technologies, such as high-bandwidth memory (HBM)—particularly HBM3, which offers significantly higher bandwidth than traditional DRAM—help reduce memory access latency. Additionally, larger and more intelligent on-chip caches enhance data locality and minimize reliance on off-chip memory, mitigating one of the most critical bottlenecks in hardware accelerators.

One promising approach involves modeling the entire cache memory hierarchy with a register-like structure that stores data on a single clock cycle rather than requiring tens of clock cycles. This method optimizes memory allocation and deallocation for large batches while sustaining high token output rates, significantly improving overall efficiency.

Enhancing Computational Performance
Specialized hardware accelerators designed for LLM workloads, such as matrix multiplication units and attention engines, can dramatically boost performance. Efficient dataflow architectures that minimize unnecessary data movement and maximize hardware resource utilization further enhance computational efficiency. Mixed-precision computing, which employs lower-precision formats like FP8 where applicable, reduces both memory bandwidth requirements and computational overhead without sacrificing model accuracy. This technique enables faster and more efficient execution of large-scale models.

Optimizing Software Algorithms
Software optimization plays a crucial role in fully leveraging hardware capabilities. Highly optimized kernels tailored to LLM operations can unlock significant performance gains by exploiting hardware-specific features. Gradient checkpointing reduces memory usage by recomputing gradients on demand, while pipeline parallelism allows different model layers to be processed simultaneously, improving throughput.

By integrating these hardware and software optimizations, accelerators can more effectively handle the intensive computational and memory demands of large language models.

About Lauro Rizzatti

Lauro Rizzatti is a business advisor to VSORA, an innovative startup offering silicon IP solutions and silicon chips, and a noted verification consultant and industry expert on hardware emulation.

Also Read:

A Deep Dive into SoC Performance Analysis: Optimizing SoC Design Performance Via Hardware-Assisted Verification Platforms

A Deep Dive into SoC Performance Analysis: What, Why, and How

SystemReady Certified: Ensuring Effortless Out-of-the-Box Arm Processor Deployments


TRNG for Automotive achieves ISO 26262 and ISO/SAE 21434 compliance

TRNG for Automotive achieves ISO 26262 and ISO/SAE 21434 compliance
by Don Dingee on 02-27-2025 at 6:00 am

Synopsys Automotive NIST TRNG

The security of a device or system depends mainly on being unable to infer or guess an alphanumeric code needed to gain access to it or its data, be that a password or an encryption key. In automotive applications, the security requirement goes one step further – an attacker may not gain access per se, but if they can compromise vehicle safety in some way, they can cause significant problems for vehicles, property, and people. A cornerstone of security implementations is truly random numbers, and Synopsys has recently certified its True Random Number Generator (TRNG) for Automotive, achieving ISO 26262 compliance and ISO/SAE 21434 compliance.

Security and safety: increasing concerns for connected vehicles

Cars and trucks are starting to look less like embedded electronics systems and more like enterprise systems as cloud connectivity and CPU and AI processing take on more significant roles. Vehicles can now speak to the cloud, other vehicles, surrounding sensors, traffic signals, parking control, and other infrastructure.

Ensuring vehicle safety now includes preventing unauthorized remote access to its mission-critical systems via wireless communication. Security architectures rely on random numbers for:

  • Cryptographic keys: Modern cryptographic algorithms help increase unpredictability by using secure, hardened keys resistant to high-computational-power cracking schemes.
  • Authentication: Devices must authenticate on a network before participating, using secure tokens and challenge/response codes to verify their identity.
  • Nonce generation and initial values: Many algorithms require a unique, random number as a starting point or a seed value to ensure a data block’s unique processing.
  • Entropy: A need for randomness supporting the development of secure and resilient communication protocols that can withstand sophisticated cyberattacks.

The National Institute of Standards and Technology (NIST) drives standardization for random number generation in the NIST SP 800 family of specifications. NIST SP 800-90A covers deterministic random bit generators, while NIST SP 800-90B defines entropy sources, and NIST SP 800-90C standardizes non-deterministic random bit generators, combining the deterministic and entropic approaches for truly random numbers.

Functional safety via ISO 26262 and automotive cybersecurity via ISO/SAE 21434 add another layer of more formal certification requirements for automotive systems. Both standards help evaluate and categorize risks of system degradation and their severity, pointing developers to areas requiring risk mitigation or elimination. Third-party compliance testing audits automotive electronics and software design processes and verifies implementations.

Extending proven Synopsys TRNG IP solutions to automotive

Synopsys has developed and fielded TRNG IP solutions for many years. The architecture combines signal conditioning with noise sources providing ongoing entropy while not depending on process-specific circuitry, helping make the IP solution easily portable across technologies.

The latest TRNG for Automotive solution provides high-quality random numbers while integrating into automotive systems focused on safety and cybersecurity. The automotive variant of the IP derives from the NIST SP 800-90C compliant TRNG Core. It incorporates additional safety mechanisms enhancing the ability to detect, recover, and report anomalies that can lead to system failures. These mechanisms include parity bus protection for interfaces, dual rail alarms monitoring two separate data paths, and parity protection on input buffers and safety registers.

Third-party compliance evaluation at SGS-TÜV has certified the TRNG for Automotive IP for ISO 26262 with ASIL D compliance for systematic faults and ASIL B compliance for random hardware faults. Compliance with SAE/ISO 21434 cybersecurity processes is also certified by SGS-TÜV for the Automotive TRNG solution.

This no-compromise approach from Synopsys allows automakers and automotive suppliers to create communication and processing schemes with secure, safe cryptographic features based on highly reliable TRNG. More details on TRNG solutions are available online from Synopsys.

Datasheet: Synopsys True Random Number Generator for NIST SP 800-90C

White Paper: Truly Random Number Generators for Truly Secure Systems


Is Arteris Poised to Enable Next Generation System Design?

Is Arteris Poised to Enable Next Generation System Design?
by Mike Gianfagna on 02-26-2025 at 10:00 am

Is Arteris Poised to Enable Next Generation System Design?

The semiconductor ecosystem is changing. Monolithic design is becoming multi-die design. Processors no longer inform software development options. It’s now the other way around with complex AI software informing the design of purpose-built hardware. And all that special-purpose hardware needs drivers to make it come to life. This interplay of complex, multi-chip connectivity and ever-increasing demands of how the software invokes the hardware are all new. This isn’t your father’s (or mother’s) chip design project. All of this made me wonder where the driving forces will be to take us to the next level of semiconductor system design. There are many important players in this field. Recently, I was struck by a series of observations about one of those players. The apparent alignment is noteworthy. In this post, I’ll explore those observations. Is Arteris poised to enable next generation system design?

Some Observations

Most folks think “network on a chi”, or NOC when they hear the name Arteris. The company has certainly blazed an important trail toward automating the interconnect of vast on-chip resources. That is the beginning of the story and not the end, however. Providing the backbone to connect the parts of a complex design opens many doors. Let’s look at a few.

One is the “memory wall” problem. While collections of CPUs and GPUs deliver huge amounts of performance, the memories that manage critical data for those systems lag in performance. And they lag a lot – many orders of magnitude. This is the memory wall problem.

A popular approach to dealing with this issue is to pre-fetch data and store it in a local cache. This way is far faster – a few CPU cycles vs. over 100 CPU cycles. It’s a great approach, but it can be tricky to ensure the right data is in the right place at the right time, and consistent across all caches. Systems that effectively deliver this solution are called cache coherent, and achieving this goal is quite difficult.

Arteris has developed a cache-coherent NOC to address this challenge. That’s one obstacle out of the way. You can learn more about this Arteris solution here.

Another challenge is just keeping track of the huge list of IPs used in contemporary designs. Tasks here include ensuring the right IP is deployed, and all stakeholders have up-to-date information and keeping track of the various configurations and derivatives.  Current designs can contain 500+ up to 1K IP blocks with 200K+ up to 5M+ registers. Challenges this creates include:

  • Content integration from various sources (soft IP, configurable third-party IP, internal legacy
  • Design manipulation (derivative, repartitioning, restructuring)

There are many formats used to keep track of all this. Spreadsheets, DOCX, IP-XACT, and SystemRDL are a few examples. Again, Arteris has a well thought out solution to this problem. Levering its Magillem technology, these problems can be tamed. You can learn more about this Arteris solution here

And I’ll examine one more. How to keep track of all the data required to orchestrate the interface between all that complex software and the hardware that brings those software innovations to life. This is typically called the hardware/software interface, or HSI.

This problem has many dimensions. Not only is it complex, but the formats needed by all stakeholders are different. That is, folks like RTL architects, software developers, verification engineers and technical publication staff all need their own version of this information in a specific format. Generating all that information in lockstep and conveying the same design intent in different formats is not easy.

In a past life, I worked with a company called  that had a very well thought out way of dealing with these challenges. What happened to Semifore?  If you guessed they are now part of Arteris, you would be right. More technology to knock down more of the obstacles to achieve next generation designs. The core tool for Semifore is called CSRCompiler, and the diagram below will give you a sense of what it can do.

CSR Compiler Unifies Design Teams

There is a lot more to be said here, but you get the idea.

What’s Next?

I’ve just scratched the surface, highlighting some high-profile challenges that must be tamed to get to the next level of semiconductor system design. It turns out Arteris has mainstream technology to address all of them. They are a NOC company, and a lot more.

There are other challenges to be met of course. IP-XACT is an important element of advanced system design. This standard, also known as IEEE 1685, is an XML format that defines and describes reusable IP to facilitate its use in chip designs. IP-XACT was created by the SPIRIT Consortium as a standard to enable automated configuration and integration through tools and evolved into an IEEE standard.

There is a new version of this standard called IEEE 1685-2022. This new version contains a lot of additional functionality. It will be important for any company who aims to enable next generation system design to support these new capabilities. A partial list of what’s new includes:

  • Removed conditionality
  • Added XML document to describe memory element definitions
  • Added mode-dependent memory and register access
  • Added mapping from ports to register fields
  • Added register field aliasing and broadcasting
  • Added power domains and power domain bindings

Even though there are new challenges on the horizon, I have an optimistic view of how Arteris can help. In an Electronic Design article, Insaf Meliane, Product Management & Marketing Manager at Arteris stated:

The ever-evolving landscape of semiconductor chip design necessitates effective communication between design teams. The HSI serves as the bridge, and while challenges persist due to differing languages and requirements, tools like the CSRCompiler help simplify the process.

The methodology automatically documents changes across entire functional teams to deliver a reliable, up-to-date specification. It provides a single-source specification for register and memory-map information, fully configured for all teams in the formats and views they require.

This gives me more confidence in the Arteris approach to these problems. Is Arteris poised to enable next generation system design? I think the answer is YES, and I can’t wait to see what’s next.

Also Read:

Arteris Raises Bar Again with AI-Based NoC Design

MCUs Are Now Embracing Mainstream NoCs

Arteris Empowering Advances in Inference Accelerators


Bug Hunting in Multi Core Processors. Innovation in Verification

Bug Hunting in Multi Core Processors. Innovation in Verification
by Bernard Murphy on 02-26-2025 at 6:00 am

Innovation New

What’s new in debugging multi-/many-core systems? Paul Cunningham (GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and lecturer at Stanford, EE292A) and I continue our series on research ideas. As always, feedback welcome.

The Innovation

This month’s pick is Post-Silicon Validation of the IBM POWER9 Processor. This was published in the 2020 DATE conference. The authors are from IBM and the paper has 1 citation.

This topic continues to attract interest given accelerating growth of these platforms among hyperscalers, though for some reason the topic has created barely a ripple among our usual research paper resources. An exception is the IBM Threadmill paper we covered in 2022 and a number of following papers from the same group. Here we review the latest of these papers, describing IBM application on their POWER9 processor.

The same basic approach continues from the earlier paper, testing on post-silicon using a bare-metal exerciser with automated randomization between cycles. Several important refinements have been added. One I find interesting is irritators, used to bias towards multi-thread (and possibly multi-core) conflicts.

 

Paul’s view

We’re zooming back in again this month on randomized instruction generation for processor verification. A few years ago we blogged on a tool called Threadmill used by IBM for verification of their POWER7 processor. This month we’re checking out a short paper on their experiences verifying the POWER9 processor.

More and more companies are developing custom processors based either on Arm-64 or RISC-V ISAs. Arm-based computing is scaling out in datacenters and laptops, and RISC-V processors are becoming widespread in a variety of embedded applications. Verifying processors, especially advanced ones with multiple cores and multiple out-of-order execution pipelines is really hard and somewhat of a dark art.

Threadmill is a low level “exerciser” software program that runs on the bare metal processor directly (i.e.  without any OS). It is configured with templates – snippets of machine code parametrized so they can be randomized in some way – randomize instructions, randomize addresses, etc. The exerciser can be run pre-silicon in simulation, emulation, or on FPGA, and also can be run post-silicon in the lab.

This paper shares several interesting new nuggets on how IBM enhanced Threadmill since POWER7: Firstly, they weighted the runtime allocation of templates during emulation, running templates that find more bugs for 10-100x more clock cycles. Second, they deployed some clever information encoding tricks to assist in debug. For example, for a bug related to dropping a write when multiple cores increment the same memory address, they have each core increment that memory address by a different amount. Then the difference between actual and expected value in that address tells them which core’s write was dropped due to the race. Third, they enhance Threadmill with more tricks to bias randomization to better hit bugs. The original Threadmill paper from POWER7 shares the trick of using the same random seed across multiple cores for memory addresses. This increases the frequency of load/store races. In this POWER9 paper they biased addresses to also align with memory page boundaries, to increase the frequency of cross-page accesses. Lastly, they used AI to help further prioritize templates to hit coverage faster.

All-in compared to POWER8 there were 30% more bugs found in 80% of the time. Decent progress on a very tough problem!

Raúl’s view

State-of-the-art processors, such as the IBM POWER9 processor described in this paper, typically undergo multiple tape-outs. Pre-silicon verification cannot identify all bugs, particularly those related to hard-to-hit software timing issues, very long loops, or deep power states, which are not exposed by the Instruction Set Simulator (ISS). This challenge is exacerbated in multi-core, multi-threaded architectures.

The reviewed paper outlines the validation methodology IBM implemented for the POWER9 processor. In October 2022, we examined the methodology used for validating the POWER7 processor, many aspects of which remain applicable.

The approach involves using a bare-metal, self-contained exerciser called Threadmill, which generates sequences of instructions based on templates. These sequences are executed pre-silicon on the ISS and within a highly instrumented Exerciser on Accelerator (EoA) environment. Root cause analysis is considered complete only when a bug is reproduced across Simulation, EoA, and post-silicon Lab. The paper details numerous practical aspects of this process. For example, when the bug rate declines, hardware irritators are employed to induce new bugs, such as by artificially reducing cache sizes and queue depths. Templates with high RTL coverage that uncover numerous bugs are executed 10 to 100 times longer than usual on the accelerator.

IBM’s overall validation methodology has been improving, with results for POWER9 validation compared to POWER8 showing an increase in bugs found in EoA from 1% to 6%, in post-silicon from 1% to 4%, and a reduction in the days required to root cause 90% of the bugs from 31 to 17.

There are open-source instruction generators for RISC-V available on GitHub. The RISC-V DV (Design Verification) framework, maintained by CHIPS Alliance, is an open-source tool for verifying RISC-V processor cores. FORCE-RISC-V, an instruction sequence generator for the RISC-V instruction set architecture from Futurewei supports multi-core and multi-threaded instruction generation.

Overall, the paper provides valuable insights, especially for practitioners involved in processor validation.

Also Read:

Embracing the Chiplet Journey: The Shift to Chiplet-Based Architectures

2024 Retrospective. Innovation in Verification

Accelerating Automotive SoC Design with Chiplets

Accelerating Simulation. Innovation in Verification


2025 Outlook with Dr. Rui Tang of MSquare Technology

2025 Outlook with Dr. Rui Tang of MSquare Technology
by Daniel Nenni on 02-25-2025 at 10:00 am

Tang Rui 18 10 04 113961

Tell us a little bit about yourself and your company. 

I am Rui Tang, co-founder and VP of MSquare Technology. With a Ph.D. in Computer Engineering from Northeastern University and a master’s degree in management science and engineering from Stanford University, I bring over 23 years of experience in the IC industry. Prior to MSquare, I served as Chief Strategy Officer and General Manager at FuriosaAI, Investment Director at BOE Venture, Staff Engineer at Apple, and Principal Engineer at Oracle.

Founded in 2021, MSquare Technology is a leading provider of integrated circuit IPs and Chiplet solutions, committed to addressing challenges in chip interconnectivity and vertical integration in the smart economy era. With offices in Shanghai, Taipei, Sydney, and San Jose, and a team of over 150 employees—80% dedicated to research and development—we aim to foster an open ecosystem platform for AI and Chiplets. Our mission is to empower innovation and growth across the IP and Chiplet industry, leveraging our extensive, cutting-edge, and cost-effective IP library to meet the diverse needs of our global clients.

What was the most exciting high point of 2024 for your company?

In June 2024, MSquare Technology reached a significant milestone with the readiness of our Chiplet product–ML100 IO Die, designed to bridge the connection between UCIe and HBM (High Bandwidth Memory). This year, we achieved successful commercial adoption of the IO Die ML100 by two leading clients, marking a major step in the product’s market validation and implementation.

As a high-bandwidth memory solution, the ML100 IO Die integrates efficient Die-to-Die interconnect IP and supports the UCIe 1.1 protocol. It delivers a maximum bandwidth of 819.2GB/s and supports data transfer rates of 6400 Mbps. This product meets the stringent requirements of AI applications for high bandwidth and low power consumption, enabling ultra-low latency interconnects between chips and high-speed data transfer. These capabilities greatly enhance the efficiency of AI model training and inference, contributing to the continued advancement of AI technologies.

In addition to the success of ML100 IO Die, we’ve expanded our product portfolio with multiple new offerings in 2024, further strengthening our position in the semiconductor industry. Our IP products have been validated by the top 5 foundries, covering over 400 process nodes from 5nm to 180nm.

What was the biggest challenge your company faced in 2024?

In 2024, the semiconductor industry faced a confluence of macroeconomic fluctuations, geopolitical tensions, technological evolution, and tightened capital flows. Many chip startups struggled to survive and were eventually eliminated in the fierce competition. However, challenges often come hand-in-hand with opportunities.

From a technical perspective, MSquare encountered unique challenges in advancing Chiplet architecture, which is regarded as the future trend of the semiconductor industry. While Chiplet technology offers the potential for higher computational performance and resource integration, it also brings several critical hurdles:

  1. Complexity in Heterogeneous Architecture Disaggregation: Breaking down GPU or ASIC designs into Chiplets poses significant technical bottlenecks, as avoiding increases in area and cost while optimizing data transfer performance remains challenging.
  2. Interconnect Standards and Compatibility Issues: Although the release of the UCIe standard provides a foundation for the Chiplet ecosystem, the actual implementation is hindered by compatibility issues between different vendors’ IPs. This increases the ecosystem’s closed nature and complicates product development.

These industry-wide technical challenges demand robust innovation capabilities and high execution efficiency from companies to navigate successfully.

How is your company’s work addressing this biggest challenge? 

To address these challenges, MSquare has successfully turned obstacles into opportunities for growth through three key strategies: technological breakthroughs, ecosystem collaboration, and resource focus:

  1. Technological Innovation and Product Breakthroughs
    • In Chiplet architecture design, MSquare introduced its self-developed ML100 IO Die, which optimizes interconnect efficiency and power consumption during data processing. This innovation effectively overcomes bandwidth bottlenecks in multi-Chiplet collaboration.
    • To address memory bandwidth limitations and constraints in main chip design, we developed the groundbreaking M2LINK solution. This solution converts the HBM protocol to the UCIe protocol and integrates it into a standard module using RDL interposer packaging, while enabling seamless integration with the main SoC. This breakthrough delivers several technical and commercial advantages:
      • Reduces main chip and packaging costs
      • Enhances memory capacity and bandwidth performance
      • Increases design flexibility and shortens product development cycles
      • Ensures stability for high-performance computing and AI applications
  2. Ecosystem Collaboration and Standards Advocacy
    • MSquare actively participates in industry collaborations to promote the adoption of UCIe standards, improving IP compatibility and openness within the Chiplet ecosystem. Through deep cooperation with customers and industry partners, we have optimized Chiplet design and production processes, lowering the technical barriers for customers adopting MSquare’s technologies.
  3. Strategic Focus and Resource Optimization
    • MSquare concentrates on fast-growing market segments such as AI, high-performance computing (HPC), and data centers, channeling resources into breakthrough innovations in these areas.
    • By avoiding the inefficiencies of blind expansion, our resource allocation strategy has significantly enhanced R&D efficiency and customer delivery capabilities, further strengthening our market competitiveness.
What do you think the biggest growth area for 2025 will be, and why?

Looking ahead to 2025, the biggest growth area is expected to be the rapid expansion of AI and data centers, as AI-powered systems and devices continue to transform industries. This growth is driven by increasing demand for chips that address three core challenges in AI development: memory capacity, interconnect bandwidth, and computing performance.

As AI workloads grow more complex, there will be a surge in demand for innovative semiconductor solutions capable of meeting these evolving requirements. Additionally, the growth of AI is triggering significant changes in data center architectures, creating new opportunities for IP and Chiplet suppliers to innovate. One notable trend is the shift from traditional LPO/NPO approaches to Co-Packaged Optics (CPO) technology, which enables higher bandwidth and power efficiency. Furthermore, Optical Chiplets are gaining traction, particularly in their use within switches and in facilitating communication between XPUs and memory.

This transformative wave is reshaping the semiconductor landscape and presenting unprecedented opportunities for companies at the forefront of AI and Chiplet innovation.

How is your company’s work addressing this growth? 

MSquare is addressing the rapidly growing demands in the AI and data center industries with innovative Chiplet solutions, such as our ML100 IO Die. This product offers two flexible configurations: UCIe + HBM3 and UCIe + LPDDR. The UCIe + HBM3 solution decouples HBM from the SoC, reducing the influence of SoC temperature on HBM performance and ensuring compatibility with HBM3 chips for flexible placement. The UCIe + LPDDR solution decouples the Memory PHY from the SoC, increasing memory capacity and offering integration options that accelerate product upgrades.

The ML100 IO Die is built on the UCIe 1.1 Specification, enabling ultra-high bandwidth of up to 1 TB/s with ultra-low latency. This architecture ensures seamless communication between dies, while the integrated HBM3 IP supports transfer rates of up to 6400 Mbps. These technologies provide high flexibility and scalability, empowering customers to overcome performance bottlenecks in AI workloads and data center applications.

To further support the adoption of Chiplet designs, MSquare is continuously developing high-speed interface IPs, including UCIe, HBM, and LPDDR, while collaborating with ecosystem partners in packaging and manufacturing. This approach not only ensures seamless integration but also accelerates solution implementation for customers. With these advancements, MSquare is well-positioned to lead the industry and drive innovation in next-generation AI and data center technologies.

What conferences did you attend in 2024 and how was the traffic?

In 2024, we participated in several prominent conferences across the globe, including:

  • IIC 2024, Shanghai
  • AI Hardware & Edge AI Summit 2024, San Jose
  • SemiBAY, Shenzhen
  • EE Tech Summit, Taipei
  • ICCAD-Expo 2024, Shanghai

Traffic at these events varied. ICCAD-Expo 2024, for example, exceeded expectations with over 6,700 attendees, highlighting the strong interest in advanced chip design. Similarly, AI-focused events like the AI Hardware & Edge AI Summit saw significant year-on-year growth in attendance, reflecting the increasing momentum in artificial intelligence and edge computing sectors.

On the other hand, events such as SemiBAY and EE Tech Summit had moderate but steady traffic, providing valuable opportunities for targeted networking and discussions in more traditional semiconductor and electronics domains.

Overall, the increased attendance at AI and chip design conferences aligns with the industry’s shift toward intelligent computing and advanced architectures, reinforcing our focus on these high-growth areas.

Will you attend conferences in 2025? Same or more?

Yes, we plan to attend conferences in 2025, and the number will likely increase compared to 2024. Based on the positive outcomes and valuable connections we gained from this year’s events, we aim to expand our presence at both existing and new conferences.

In particular, we will focus on attending events in high-growth areas such as artificial intelligence and Chiplet, aligning with industry trends and our strategic priorities. AI-related conferences, which have seen a rise in attendance and engagement, will remain a key focus for us.

Additionally, we are exploring opportunities to participate in more global forums to strengthen our international visibility. By attending a broader range of conferences, we aim to stay at the forefront of emerging technologies and continue fostering partnerships that drive innovation.

How do customers engage with your company?

Customers engage with us through multiple channels, ensuring a seamless and collaborative experience throughout their journey:

  1. Direct Collaboration:
    We work closely with customers through direct consultations to understand their needs, provide tailored IP solutions, and assist in their chiplet design and integration efforts. Our technical team actively supports their development process to ensure smooth implementation.
  2. Conferences and Industry Events:
    Conferences such as IIC, ICCAD-Expo, and AI Hardware Edge AI Summit serve as key touchpoints where customers can engage with us in person. These events allow us to showcase our latest products, demonstrate use cases, and discuss collaborative opportunities face-to-face.
  3. Digital Channels:
    Many customers reach out to us through our website, online webinars, and social media platforms. These channels provide accessible ways for customers to learn about our offerings, request product demos, and initiate discussions.
  4. Long-Term Partnerships:
    For our strategic partners, engagement is often a continuous, multi-phase process. We work as an extension of their teams, offering co-development opportunities and customized solutions aligned with their roadmaps.

By offering diverse engagement methods, we ensure that customers have the flexibility to interact with us in ways that best suit their needs, ultimately building strong, long-term relationships.

Additional questions or final comments? 

As we look ahead to 2025, we’re excited about the continued opportunities in Chiplet, AI, and high-performance computing. Our team is focused on pushing the boundaries of innovation while maintaining strong customer relationships. If there are any specific questions regarding our products or upcoming developments, we’d be happy to discuss them further.

We appreciate your interest in our company, and we look forward to exploring new ways to collaborate and contribute to the advancement of the semiconductor industry in the years to come. Feel free to reach out anytime!

Also Read:

How Synopsys Enables Gen AI on the Edge

Harnessing Modular Vector Processing for Scalable, Power-Efficient AI Acceleration

2025 Outlook with Paul Wells of sureCore