DAC2025 SemiWiki 800x100

Anomaly Detection Through ML. Innovation in Verification

Anomaly Detection Through ML. Innovation in Verification
by Bernard Murphy on 08-31-2023 at 6:00 am

Assertion based verification only catches problems for which you have written assertions. Is there a complementary approach to find problems you haven’t considered – the unknown unknowns? Paul Cunningham (Senior VP/GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and now Silvaco CTO) and I continue our series on research ideas. As always, feedback welcome.

The Innovation

This month’s pick is Machine Learning-based Anomaly Detection for Post-silicon Bug Diagnosis. The paper published in the 2013 DATE Conference. The authors are/were from the University of Michigan.

Anomaly detection methods are popular where you can’t pre-characterize what you are looking for, in credit card fraud for example or in real-time security where hacks continue to evolve. The method gathers behaviors over a trial period, manually screened to be considered within expected behavior, then looks for outliers in ongoing testing as potential problems for closer review.

Anomaly detection techniques either use statistical analyses or machine learning. This paper uses machine learning to build a model of expected behavior. You could also easily imagine this analysis being shifted left into pre-silicon verification.

Paul’s view

This month we’ve pulled a paper from 10 years ago on using machine learning to try and automatically root cause bugs in post-silicon validation. It’s a fun read and looks like a great fit for re-visiting again now using DNNs or LLMs.

The authors equate root-causing post-silicon bugs to credit card fraud detection: every signal traced in every clock cycle can be thought of as a credit card transaction, and the problem of root causing a bug becomes analogous to identifying a fraudulent credit card transaction.

The authors’ approach goes as follows: divide up simulations into time slices and track the percent of time each post-silicon traced debug signal is high in each time slice. Then partition the signals based on the module hierarchy, aiming for a module size of around 500 signals. For each module in each time slice train a model of the “expected” distribution of signal %high times using a golden set of bug free post-silicon traces. This model is a very simple k-means clustering of the signals using difference in %high times as the “distance” between two signals.

For each failing post-silicon test, the %high signal distribution for each module in each time slice is compared to the golden model and the number of signals whose %high time is outside the bounding box of its golden model cluster are counted. If this number is over a noise threshold, then those signals in that time slice are flagged as the root cause of the failure.

It’s a cool idea but on the ten OpenSPARC testcases benchmarked, 30% of the tests do not report the correct time slice or signals, which is way too high to be of any practical use. I would love to see what would happen if a modern LLM or DNN was used instead of simple k-means clustering.

Raúl’s view

This is an “early” paper from 2013 using machine learning for post-silicon bug detection. For the time this must have been advanced work listed with 62 citations in Google Scholar.

The idea is straight forward: run a test many times on a post-silicon design and record the results. When intermittent bugs occur, different executions of the same test yield different results, some passing and some failing. Intermittent failures, often due to on-chip asynchronous events and electrical effects, are among the most difficult to diagnose. The authors briefly consider using supervised learning, in particular one-class learning (there is only positive training data available, bugs are rare), but discard it as “not a good match for the application of bug finding”. Instead, they apply k-means clustering; similar results are grouped into k clusters consisting of “close” results minimizing the sum-of-squares distance within clusters. The paper reveals numerous technical details necessary to reproduce the results: Results are recorded as the “fraction of time the signal’s value was one during the time step”; the number of signals from a design, of the order of 10,000, is the dimensionality in k-means clustering which is NP-hard with respect to the number of dimensions, so the number of signals is capped to 500 using principal component analysis; the number of clusters can’t be too small (underfitting) nor too large (overfitting); a proper anomaly detection threshold needs to be picked, expressed as the percentage of the total failing examples under consideration; time localization of a bug is achieved by two-step anomaly detection, identifying which time step presents a sufficient number of anomalies to reveal the occurrence of a bug and then in a second round identifying the responsible bug signals.

Experiments for an OpenSPARC T2 design of about 500M transistors ran 10 workloads of test lengths ranging between 60,000 and 1.2 million cycles 100 times each as training. Then they injected 10 errors and ran 1000 buggy tests. On average 347 signals were detected for a bug (ranging from none to 1000) and it took ~350 cycles of latency from bug injection to bug detection. Number of clusters and detection threshold strongly influence the results, as does the training data quantity. False positives and false negatives added up to 30-40 (in 1000 buggy tests).

Even though the authors observe that “Overall, among the 41,743 signals in the OpenSPARC T2 top-level, the anomaly detection algorithm identified 347, averaged over the bugs. This represents 0.8% of the total signals. Thus, our approach is able to reduce the pool of signals by 99.2%”, in practice this may not be of great help to an experienced designer. 10 years have passed, it would be interesting to repeat this work using today’s machine learning capabilities, for example LLMs for anomaly detection.


RISC-V 64 bit IP for High Performance

RISC-V 64 bit IP for High Performance
by Daniel Payne on 08-30-2023 at 10:00 am

Atrevido min

RISC-V as an Instruction Set Architecture (ISA) has grown quickly in commercial importance and relevance since its release to the open community in 2015, attracting many IP vendors that now provide a variety of RTL cores. Roger Espasa, CEO and Founder of Semidynamics, has presented at RISC-V events on how their IP is customized for compute challenges that require high bandwidth and high performance cores with vector units. Semidynamics was founded in 2016, has Barcelona for the HQ, and already has customers in the US and Asia by offering two customizable RISC-V IPs:

  • Avispado – in-order RISCV64GCV, supporting AXI and CHI
  • Atrevido – out-of-order RISCV64GC, supporting AXI and CHI

A typical CPU has a handful of big cores and large caches, making them easy to program, though not high performance.

GPUs, by contrast, have many tiny cores that provide high performance for parallel code, but are harder to program and add communication latency through the PCIe bus when data needs to be passed back and forth between the CPU and the GPU.

CPU, GPU comparison

The approach at Semidynamics is to use a RISC-V core connected to compute cores which makes it easy to program, higher performance for parallel codes and offering zero communication latency. CPU plus vector unit provides the best of both worlds.

CPU plus Vector unit

The RISC-V specification documents 32 vector registers, and you can add a number of vector cores, along with a connection to your cache inside a vector unit.

Vector Unit

With Semidynamics IP you can customize the number of Vector Cores: 4, 8, 16, 32. Another way to look at this is to note that 4 Vector Cores is 256-bit, up to 32 Vector Cores which is 2,048-bit.

IP users also choose which data types: FP64, FP32, FP16, BF16, INT64, INT32, INT16, INT8. For an AI application they may choose data types of FP16, BF16, while an HPC application could select FP64, FP32.

The third customization is the Vector Register Length, where for more performance and lower power you can make the vector register bigger than the vector unit.

Here’s the block diagram of the Atrevideo 423-V8:

Atrevido 423 + V8 Vector Unit

The vector unit is fully out of order, which is unique among RISC-V IP vendors. The combination of the vector unit plus Gazzillion unit are capable of streaming data at over 60 Bytes/cycles.

High Bandwidth: Vector + Gazzillion

The purple line shows the Read performance and in the L1 Cache it’s 20-60 bytes/cycle, other machines show a rapid drop in bandwidth after leaving L1 Cache, while this approach keeps going, with a flattening at 56. Even going to DDR memory shows a bandwidth of 40. With a clock rate of 1.0GHz that makes 40 GB/s bandwidth.

IP customers can even add their own RTL code connected to the Vector Unit for their own purposes.

Performance of matrix multiplication is important in AI workloads, and on the OOO V8 Vector Unit there’s a peak of 16 FP64 FLOPS/cycle, and a 99% of peak for a matrix size >= 400. For a small matrix size of 24×24 the performance is 7 FP64 FLOPS/cycle, or 50% of peak. Matrix multiplication for FP16 using a Vector Unit with 8 vector cores has a peak of 64 FP16 FLOPS/cycle, and 99% of peak for M >= 600.

A real-time object detection benchmark called YOLO (You Only Look Once) was run on the Atrevido 423-V8 platform, and it showed a 58% higher performance per vector core than competitors. These results were for video with 24 layers. 5.56 Gops/frame and about 9M parameters.

YOLO Comparison

Summary

Choosing a RISC-V IP vendor is a complicated task, so knowing about vendors like Semidynamics can help you better understand how a customized approach could most efficiently run your specific workloads. With Semidynamics you get to choose between architectural choices like in-order or out-of-order, with or without vector units. The reported numbers from this IP vendor look promising, and I look forward to their future announcements.

Related Videos

Also Read:

Deeper RISC-V pipeline plows through vector-scalar loops

RISC-V Summit Buzz – Semidynamics Founder and CEO Roger Espasa Introduces Extreme Customization

Configurable RISC-V core sidesteps cache misses with 128 fetches


Modeling EUV Stochastic Defects with Secondary Electron Blur

Modeling EUV Stochastic Defects with Secondary Electron Blur
by Fred Chen on 08-30-2023 at 8:00 am

Modeling EUV Stochastic Defects With Secondary Electron Blur

Extreme ultraviolet (EUV) lithography is often represented as benefiting from the 13.5 nm wavelength (actually it is a range of wavelengths, mostly ~13.2-13.8 nm), when actually it works through the action of secondary electrons, electrons released by photoelectrons which are themselves released from ionization by absorbed EUV (~90-94 eV) photons. The photons are not only absorbed in the photoresist film but also in the layers underneath. The released electrons migrate varying distances from the point of absorption, losing energy in the process.

These migration distances can go over 10 nm [1-2]. Consequently, images formed by EUV lithography are subject to an effect known as blur. Blur can be most basically understood as the reduction of the difference between the minimum and maximum chemical response of the photoresist. Blur is often modeled through a Gaussian function convolved with the original optical image [3-4].

In such modeling, however, it is often neglected to mention that the blur scale length, often referred as sigma, is not a fundamentally fixed number, but belongs to a distribution [5]. This is consistent with the fact that the higher EUV dose leads to a larger observed blur [2,5]. More electrons released allows a larger range of distances traveled [2,6]. Note that pure chemical blur from diffusion does not have the same dose dependence [3,7].

It was recently demonstrated that secondary electron blur increasing with dose can lead to the observed stochastic defects in EUV lithography [8]. The higher dose leads to a wider allowed range of blur.

Local base blur range at different doses, taken at different probabilities from the base blur probability distribution.

The simulation model combines three stages of random number generation: (1) photon absorption, (2) secondary electron yield, and (3) electron dose-dependent blur range. Unexposed stochastic defects are dominant at low doses where there are too few photons absorbed. Exposed stochastic defects are dominant at higher doses where the rare (e.g., probability ~ 1e-8) ultrahigh (>10 nm) blur promotes too much secondary electron exposure near the threshold value for printing.

Higher blur makes it easier for smaller stochastic dose variations to cross the printing threshold, enabling exposed or unexposed defects.

One consequence of both insufficient low photon absorption and dose-increased blur causing defects is the emergence of a floor or valley for stochastic defects, preventing them from being absent entirely.

At lower dose or exposed CD there tend to be unexposed defects, while at higher dose or exposed CD there tend to be exposed defects. This results in a floor or valley for stochastic defect occurrence.

Another way to interpret the defect floor or valley is that the enlarged blur range at low enough probability increases the entropy significantly and damages the image across all possible printing thresholds.

With the much larger blur range at low enough probabilities (1e-9 in this example), there is significant entropy in the image and the image is damaged regardless of printing threshold. At more commonly observed probabilities (e.g., 1e-1), the image preserves its usual appearance. Note: the raw pixel images were smoothed for better visualization.

It is therefore very risky to not include dose-dependent secondary electron blur ranges in any model for EUV lithography image or defect formation.

References

[1] I. Bespalov, “Key Role of Very Low Energy Electrons in Tin-Based Molecular Resists for Extreme Ultraviolet Nanolithography,” ACS Appl. Mater. Interfaces 12, 9881 (2020).

[2] S. Grzeskowiak et al., “Measuring Secondary Electron Blur,” Proc. SPIE 10960, 1096007 (2019).

[3] D. Van Steenwinckel et al., “Lithographic Importance of Acid Diffusion in Chemically Amplified Resists,” Proc. SPIE 5753, 269 (2005).

[4] T. Brunner et al., “Impact of resist blur on MEF, OPC, and CD control,” Proc. SPIE 5377, 141 (2004).

[5] A. Narasimhan et al., “Studying secondary electron behavior in EUV resists using experimentation and modeling,” Proc. SPIE 942, 942208 (2015).

[6] M. Kotera et al., “Extreme Ultraviolet Lithography Simulation by Tracing Photoelectron Trajectories in Resist, Jpn. J. Appl. Phys. 47, 4944 (2008).

[7] M. Yoshii et al., “Influence of resist blur on resolution of hyper-NA immersion lithography beyond 45-nm half-pitch,” J. Micro/Nanolith. MEMS MOEMS 8, 013003 (2009).

[8] F. Chen, “EUV Stochastic Defects from Secondary Electron Blur Increasing With Dose,” https://www.youtube.com/watch?v=Q169SHHRvXE, 8/20/2023.

This article first appeared in LinkedIn Pulse: Modeling EUV Stochastic Defects With Secondary Electron Blur

Also Read:

Enhanced Stochastic Imaging in High-NA EUV Lithography

Application-Specific Lithography: Via Separation for 5nm and Beyond

ASML Update SEMICON West 2023


Arm Inches Up the Infrastructure Value Chain

Arm Inches Up the Infrastructure Value Chain
by Bernard Murphy on 08-30-2023 at 6:00 am

Arm just revealed at HotChips their compute subsystems (CSS) direction led by CSS N2. The intent behind CSS is to provide pre-integrated, optimized and validated subsystems to accelerate time to market for infrastructure system builders. Think HPC servers, wireless infrastructure, big edge systems for industry, city, enterprise automation. This for me answers how Arm can add more value to system developers without becoming a chip company. They know their technology better than anyone else; by providing pre-designed, optimized and validated subsytems – cores, coherent interconnect, interrupt, memory management and I/O interfaces, together with SystemReady validation – they can chop a big chunk out of the total system development cycle.

Accelerating Custom Silicon

A completely custom design around core, interconnect, and other IPs obviously provides maximum flexibility and ability to differentiate but at a cost. That cost isn’t only in development but also in time to deployment. Time is becoming a very critical factor in fast moving markets – just look at AI and the changes it is driving in hyperscaler datacenters. I have to believe current economic uncertainties compound these concerns.

Those pressures are likely forcing an emphasis on differentiating only where essential and standardizing everywhere else, especially when proven experts can take care of a big core component. CSS provides a very standard yet configurable subsystem for many-core compute, include N2 cores (in this case), the coherent mesh network between those cores, together with interrupt and memory management, cache hierarchy, chiplet support through UCIe or custom interfaces, DDR5/LPDDR5 external memory interface, PCIe/CXL Gen5 for fast IO and or coherent IO, expansion IO, and system management.

All PPA optimized for an advanced 5nm TSMC process and proven SystemReady® with a reference software stack. The system developer still has plenty of scope for differentiation through added accelerators, specialized compute, their own power management, etc.

Neoverse V2

Arm also announced a next step in the Neoverse V-series, unsurprisingly improved over the V1 version with improved integer performance and reduction in system level cache misses. There is improvement on a variety of other benchmarks also.

Also noteworthy is its performance in the NVIDIA Grace-Hopper combo (based on Neoverse V2). NVIDIA shared real hardware data with Arm on performance versus Intel Sapphire Rapids and AMD Genoa. In raw performance the Grace CPU was mostly at par with AMD and generally faster than Sapphire Rapids by 30-40%.

Most striking for me was their calculation for a datacenter limited to 5MW, important because all datacenters are ultimately power limited. In this case Grace bested AMD in performance by between 70% and 150% and was far ahead of Intel.

Net value

First on Neoverse’s contribution to Grace-Hopper – wow. That system is at the center of the tech universe right now, thanks to AI in general and large language models in particular. This is an incredible reference. Second, while I’m sure that Intel and AMD can deliver better peak performance than Arm-based systems, and Grace-Hopper workloads are somewhat specialized, (a) most workloads don’t need high end performance and (b) AI is getting into everything now. It is becoming increasingly difficult to make a case that, for cost and sustainability over a complete datacenter, Arm-based systems shouldn’t play a much bigger role especially as expense budgets tighten.

For CSS-N2, based on their own analysis Arm estimates up to 80 engineering years of effort required to develop the CSS N2 level of integration, a number that existing customers confirm is in the right ballpark. In an engineer-constrained environment, this is 80 engineering years they can drop from their program cost and schedule without compromising whatever secret differentiation the want to add around the compute core.

These look like very logical next steps for Arm in their Neoverse product line. Faster performance in the V-series and let customers take advantage of Arm’s own experience and expertise in building N2-based compute systems, while leaving open lots of room for adding their own special sauce. You can read the press release HERE.


Visit with Easy-Logic at #60DAC

Visit with Easy-Logic at #60DAC
by Daniel Payne on 08-29-2023 at 10:00 am

Easy-Logic at #60DAC

I had read a little about Easy-Logic before #60DAC, so this meeting on Wednesday in Moscone West was my first in-person meeting with Jimmy Chen and Kager Tsai to learn about their EDA tools and where they fit into the overall IC design flow. A Functional Engineering Change Order (ECO) is a way to revise an IC design by updating the smallest portion of the circuit, avoiding a complete re-design. An ECO can happen quite late in the design stage, causing project delays or even failures, so minimizing this risk and reducing the time for an ECO is an important goal, one that Easy-Logic has productized in a tool called EasylogicECO.

Easy-Logic at #60DAC

This EDA tool flow diagram shows each place where EasylogicECO fits in with logic synthesis, DFT, low power insertion, Place & Route, IC layout and tape-out.

EasylogicECO tool flow

Let’s say that your engineering team is coding RTL and they find a bug late in the design cycle, they could make an RTL change and then use the EasylogicECO tool to compare the differences between the two RTL versions, and then implement the ECO changes, where the output is an ECO netlist and the commands to control the Place & Route tools from Cadence or Synopsys.

Another usage example for EasylogicECO is post tape-out where a bug is found or the spec changes, and then you want to do a metal-only ECO change in order to keep mask costs lower.

Easy-Logic is a 10 year old company, based in Hong Kong, and their EasylogicECO tool came out about 5-6 years ago. Most of their customers are in Asia and the names have been kept private, although there are quotes from several companies, like: Sitronix, Phytium, Chipone, Loongson Technology, ASPEED and Erisedtek. Users have designed products in industries for cell phone, HPC, networking, AI, servers, and high-end segments.

EasylogicECO is being used mostly on the advanced nodes, such as 7nm and 10nm, where design sizes can be 5 million instances per block, and functional ECOs are used at the module and block levels. Their tool isn’t really replacing other EDA tools, rather it fits neatly into existing EDA tool flows as shown above. Both Unix and Linux boxes run EasylogicECO, and the run times really depend on the complexity of the design changes. With a traditional methodology it could take 5 days to update a block with 5 million instances, but now with the Easy-Logic approach it can take only 12 hours. This methodology aims to make the smallest patch in the shortest amount of time.

Easy-Logic works at the RTL level. After logic synthesis you basically lose the design hierarchy, which makes it hard to do an ECO. Patents have been issued for the unique approach that EasylogicECO takes by staying at the RTL level.

Engineering teams can quickly evaluate within a day or two this approach from Easy-Logic. They’ve made the tool quite easy to use, so there’s a quick learning curve, as your inputs are just the original RTL, the revised RTL, the original netlist, the synthesized netlist of the revised RTL, and a library.

With 50 people in the company, you can contact an office in Hong Kong, San Jose, Beijing or Taiwan. 2023 was the first year at DAC for the company. Engineers can use this new ECO approach in four use cases:

  • Functional ECO
  • Low power ECO
  • Scan chain ECO
  • Metal ECO

Summary

SoC design is a very challenging approach to product development where time is money, and making last-minute changes like ECOs can make or break the success of a project. Easy-Logic has created a methodology to drastically shorten the time it takes for an ECO, while staying at the RTL level. I expect to see high interest in their EasylogicECO tool this year, and more customer success stories by next DAC in 2024.

Related Blogs

Key MAC Considerations for the Road to 1.6T Ethernet Success

Key MAC Considerations for the Road to 1.6T Ethernet Success
by Kalar Rajendiran on 08-29-2023 at 6:00 am

The World of Ethernet is Gigantic and Growing

Ethernet’s continual adaptation to meet the demands of a data-rich, interconnected world can be credited to the two axes along which its evolution has been propelled. The first axis emphasizes Ethernet’s role in enabling precise and reliable control over interconnected systems. As industries embrace automation and IoT, Ethernet facilitates real-time monitoring, seamless communication, and deterministic behavior, fostering a new era of industrial and infrastructure advancements. The second axis underscores Ethernet’s capacity to handle the burgeoning volumes of data generated by modern applications. From cloud computing to AI-driven analytics, Ethernet serves as the backbone for data movement, storage, and deep analysis, accelerating insights and innovation across diverse domains. The next speed milestone in ethernet’s evolution is 1.6T and this transformative leap requires a meticulous approach to meet the requirements along both of the above axes.

The advent of 1.6T Ethernet heralds a new era of connectivity, one where data-intensive applications will seamlessly coexist with latency-sensitive demands. Through the convergence of 224G SerDes technology, flexible and configurable MAC and PCS IP developments, and optimized silicon architectures, the networking industry can deliver solutions that not only meet but exceed the requirements of 1.6T ethernet systems. This is the context of a Synopsys-sponsored webinar where Jon Ames and John Swanson spotlighted the focus areas of design for achieving efficiency and delivering performance.

Key Considerations for 1.6T Ethernet Success

At the heart of the Ethernet subsystem are the application and transmit/receive (Tx/Rx) queues. Application queues handle data coming from applications and services running on network-connected devices. These queues manage the flow of data into the Ethernet subsystem for transmission. The Tx/Rx queues manage the movement of packets between the Media Access Control (MAC) layer and the PHY layer for transmission and reception, respectively. Efficient queue management ensures optimal data flow and minimizes latency. Scalability, flexibility, efficient packet handling, streamlined error handling, low latency, support for emerging protocols, energy efficiency, forward error correction (FEC) optimization, security and data integrity, interoperability and compliance are all key considerations in an Ethernet subsystem.

The MAC layer is responsible for frame formatting, addressing, error handling, and flow control. It manages the transmission and reception of Ethernet frames and interacts with the PHY layer to control frame transmission timings. Timing considerations are crucial to ensure proper communication between the PHY and MAC layers, especially at high speeds.

The Physical Coding Sublayer (PCS) is responsible for encoding and decoding data for transmission and reception. It interfaces between the MAC layer and the PMA/PMD layer. The PCS manages functions like data scrambling, error detection, and link synchronization. It prepares data from the MAC layer for transmission through the PMA/PMD layer.

The PMA (Physical Medium Attachment), PMD (Physical Medium Dependent), and PHY (Physical Layer) collectively handle the physical transmission of data over the network medium, be it copper cables or optical fibers. The PMA/PMD layer performs functions like clock and data recovery, signal conditioning, and modulation. The PHY layer manages signal transmission, equalization, and error correction to ensure reliable data transfer at high speeds.

The synergy between cutting-edge 224G SerDes technology and the development of innovative MAC and PCS IP is poised to redefine the accessibility and scalability of 1.6T Ethernet. These components play a pivotal role in the realization of off-the-shelf solutions that seamlessly align with forthcoming 1.6T Ethernet standards. The 224G SerDes technology offers the crucial physical layer connectivity required to sustain the high data rates demanded by 1.6T Ethernet. Achieving successful communication at high data rates requires close coordination between the PHY and MAC layers, accurate timing synchronization, and the implementation of effective error correction techniques. These factors will collectively contribute to the reliability, efficiency, and performance of 1.6T Ethernet networks.

Synopsys Solutions

Synopsys MAC, PCS, and 224G SerDes IP solutions come with pre-verified and optimized designs. This means that the IP has already undergone rigorous testing and validation, reducing the need for extensive in-house verification efforts. This accelerates the development process by providing a reliable foundation to build upon. The IP solutions are designed to comply with IEEE 802.3 Ethernet standards and ensure interoperability and compatibility with a wide range of devices and network configurations. Designers can rely on the IP’s adherence to these standards, saving time that would otherwise be spent on custom protocol implementation. The solutions often come with configurability options. This enables designers to tailor the IP to their specific application requirements without having to build everything from scratch. This configurability streamlines the design process and reduces the need for extensive manual modifications.

Summary

As the race toward 1.6T Ethernet intensifies, the development of silicon solutions capable of delivering optimized power efficiency and minimal silicon footprint becomes paramount. To harness the capabilities of 1.6T Ethernet without compromising on energy consumption and design complexity, engineers must craft architectures that seamlessly merge efficiency with innovation. This involves meticulous digital design, ensuring that the intricate interaction between hardware components and software layers is harmonious, thereby producing networking solutions that are both efficient and robust and help accelerate first pass silicon success.

For more details, visit the Synopsys Ethernet IP Solutions page.

You can watch the entire webinar on-demand from here.

Also Read:

WEBINAR: Why Rigorous Testing is So Important for PCI Express 6.0

Next-Gen AI Engine for Intelligent Vision Applications

VC Formal Enabled QED Proofs on a RISC-V Core


Systematic RISC-V architecture analysis and optimization

Systematic RISC-V architecture analysis and optimization
by Don Dingee on 08-28-2023 at 10:00 am

RISC V architecture analysis and optimization chain

The RISC-V movement has taken off so quickly because of the wide range of choices it offers designers. However, massive flexibility creates its own challenges. One is how to analyze, optimize, and verify an unproven RISC-V core design with potential microarchitecture changes allowed within the bounds of the specification. S2C, best known for its FPGA-based prototyping technology, gave an update at #60DAC into its emerging systematic RISC-V architecture analysis and optimization strategy, adding modeling and emulation capability.

Three phases to RISC-V architecture analysis

RISC-V differs from other processor architectures in how much customization is possible – from execution unit and pipeline configurations all the way to adding customized instructions. Developers are exploring the best fits of various RISC-V configurations in many applications, where some definitions are still ambiguous. EDA support has yet to catch up; basic tools exist, but few advanced modeling platforms are available.

These conditions leave teams in a problem: if they extend the RISC-V instruction set for their implementation, they must create new cycle-accurate models for those instructions before assessing performance, simulated or emulated. S2C is working to fill this void with a complete chain for systematic RISC-V architecture analysis and optimization featuring one familiar technology flanked by two others.

First in the chain is S2C’s new RISC-V “core master” model abstraction platform, Genesis. It provides stochastic modeling, system architecture modeling, and cycle-accurate modeling, with increasing levels of accuracy as models add fidelity. Genesis allows the simulation of commercially available RISC-V cores as IP modules, then updating parameters or adding custom logic to the microarchitecture. These simulations enable earlier optimization of cores.

Holding the middle of the analysis chain is the S2C Prodigy prototyping family, facilitating FPGA-based prototypes for hardware logic debugging, basic performance assessment, and early software development. Prodigy prototyping hardware also accepts off-the-shelf I/O modules developed by S2C for stimulus and consumption of real-world signals around the periphery of the SoC, as well as RISC-V IP performance verification.

 

New emulation capability comes with S2C’s OmniArk hybrid emulation system, capable of hyper-scale verification of RISC-V SoCs. OmniArk specializes in compiling automotive SoCs and boasts powerful debugging capabilities for an efficient verification environment. It scales up to 1 billion gates for large designs and supports verification modes like QEMU, TBA, and ICE.

An example: collaboration on the XiangShan RISC-V core project

Accurate behavioral models of RISC-V cores carry through early modeling, FPGA-based prototyping, and hardware emulation processes. Giving designers better control of both IP and models enables tasks once only possible in hardware prototypes to shift into virtual analysis activities earlier in the design cycle, creating more opportunities for optimization.

An example of systematic RISC-V architecture analysis and optimization is in S2C’s collaboration with the XiangShan project team based at the Chinese Academy of Sciences. XiangShan is a superscalar, six-wide, out-of-order RISC-V implementation targeting a Linux variant for its operating system.

The XiangShan team used S2C products to create a core verification platform integrated with an external GPU and other peripherals. The hyperscale core partitions into an S2C FPGA-based prototyping platform, with peripherals added via PCIe and other interfaces.

“As RISC-V technology has penetrated various fields, its open-source, conciseness, and high scalability are redefining the future of computing,” says Ying J. Chen, Vice President at S2C. “S2C’s three major product lines can provide various solutions like software performance evaluation for microarchitecture analysis, system integration, and specification compliance testing based on RISC-V.”

We expect more details soon from S2C on how the systematic RISC-V architecture analysis and optimization chain come together with upcoming US product announcements – for now, S2C’s Chinese language site has some information on Genesis. More details on the XiangShan RISC-V project are available from tutorials given at ASPLOS’23.

Also Read:

Sirius Wireless Partners with S2C on Wi-Fi6/BT RF IP Verification System for Finer Chip Design

S2C Accelerates Development Timeline of Bluetooth LE Audio SoC

S2C Helps Client to Achieve High-Performance Secure GPU Chip Verification


AMD Puts Synopsys AI Verification Tools to the Test

AMD Puts Synopsys AI Verification Tools to the Test
by Mike Gianfagna on 08-28-2023 at 6:00 am

AMD Puts Synopsys AI Verification Tools to the Test

The various algorithms that comprise artificial intelligence (AI) are finding their way into the chip design flow. What is driving a lot of this work is the complexity explosion of new chip designs required to accelerate advanced AI algorithms. It turns out AI is both the problem and the solution in this case. AI can be used to cut the AI chip design problem down to size. Synopsys has been developing AI-assisted design capabilities for quite a while, beginning with the release of a design space optimization capability (DSO.ai) in 2020. Since then, several new capabilities have been announced, significantly expanding its AI-assisted footprint. You can get a good overview of what Synopsys is working on here. One of the capabilities in the Synopsys portfolio focuses on verification space optimization (VSO.ai). The real test of any new capability is its use by a real customer on a real design, and that is the topic of this post. Read on to see how AMD puts Synopsys AI verification tools to the test.

VSO.ai – What it Does

Test coverage of a design is the core issue in semiconductor verification. The battle cry is, “if you haven’t exercised it, you haven’t verified it.” Stimulus vectors are generated using a variety of techniques, with constrained random being a popular approach. Those vectors are then used in simulation runs on the design, looking for test results that don’t match expected results.

By exercising more of the circuit, the chance of finding functional design flaws is increased.

Verification teams choose structural code coverage metrics (line, expression, block, etc.) of interest and automatically add them to simulation runs. As each test iteration generates constrained-random stimulus conforming to the rules, the simulator collects metrics for all the forms of coverage included. The results are monitored, with the goal of tweaking the constraints to try to improve the coverage. At some point, the team decides that they have done the best that they can within the schedule and resource constraints of the project, and they tape out.

Code coverage does not reflect the intended functionality of the design, so user-defined coverage is important. This is typically a manual effort, spanning only a limited percentage of the design’s behavior. Closing coverage and achieving verification goals is quite difficult.

A typical chip project runs many thousands of constrained-random simulation tests with a great deal of repetitive activity in the design. So, the rate of new coverage slows, and the benefit of each new test reduces over time.

At some point, the curve flattens out, often before goals are met. The team must try to figure out what is going on and improve coverage as much as possible within time and resource constraints. This “last mile” of the process is quite challenging. The amount of data collected is overwhelming and trying to analyze it and determine the root cause of a coverage hole is difficult and labor-intensive. is it an illegal bin for this configuration or a true coverage hole?

The design of complex chips contains many problems that look like this – the requirement to analyze vast amounts of data and identify the best path forward. The good news is that AI techniques can be applied to this class of problems quite successfully.

For coverage definition, Synopsys VSO.ai infers some types of coverage beyond traditional code coverage to complement user-specified coverage. Machine learning (ML) can learn from experience and intelligently reuse coverage when appropriate. Even during a single project, learnings from earlier coverage results can help to improve coverage models.

VSO.ai works at the coarse-grained test level and provides automated, adaptive test optimization that learns as the results change. Running the tests with highest ROI first while eliminating redundant tests accelerates coverage closure and saves compute resources.

The tool also works at the fine-grained level within the simulator to improve the test quality of results (QoR) by adapting the constrained-random stimulus to better target unexercised coverage points. This not only accelerates coverage closure, but also drives convergence to a higher percentage value.

The last mile closure challenge is addressed by automated, AI-driven analysis of coverage results. VSO.ai performs root cause analysis (RCA) to determine why specific coverage points are not being reached. If the tool can resolve the situation itself, it will. Otherwise, it presents the team with actionable results, such as identifying conflicting constraints.

The figure below summarizes the benefits VSO.ai can deliver. A top-level benefit of these approaches is the achievement of superior results in less time with less designer effort. We will re-visit this statement in a moment.

The Benefits of VSO.ai

What AMD Found

At the recent Synopsys Users Group (SNUG) held in Silicon Valley, AMD presented a paper entitled, “Drop the Blindfold: Coverage-Regression Optimization in Constrained-Random Simulations using VSO.ai (Verification Space Optimization).”  The paper detailed AMD’s experiences using VSO.ai on several designs. AMD had substantial goals and expectations for this work:

Reach 100% coverage consistently with small RTL changes and design variants, but in an optimized, automated way.

AMD applied a well-documented methodology using VSO.ai across regression samples for four different designs. The figure below summarizes these four experiments.

Regression Characteristics Across Four Designs

AMD then presented a detailed overview of these designs, their challenges and the results achieved by using VSO.ai, compared to the original effort without VSO.ai. Recall one of the hallmark benefits of applying AI to the design process:

Achievement of superior results in less time with less designer effort

In its SNUG presentation, awarded one of the Top 10 Best Presentations at the event, AMD summarized the observed benefits as follows:

  • 1.5 – 16X reduction in the number of tests being run across the four designs to achieve the same coverage
  • Quick, on-demand regression qualifier
    • Can be used to gauge how well the test distribution of a regression is if user is uncertain on iterations needed
  • Potentially target more bins under same budget
    • If default regression(s) do not achieve 100% coverage, VSO.ai can potentially exceed this (i.e., experiment #1)
  • Testcase(s) removal in coverage regressions if not contributing
  • More reliable test grading for constrained random tests
    • URG (Unified report generator): seed-based v/s
    • VSO.ai: probability-based
  • Debug
    • Uncover coverage items that have a lower probability of being hit than expected

This presentation put VSO.ai to the test and the positive impact of the tool was documented.  As mentioned, this kind of user application to real designs is the real test of a new technology. And that’s how AMD puts Synopsys AI verification tools to the test.

Also Read:

WEBINAR: Why Rigorous Testing is So Important for PCI Express 6.0

Next-Gen AI Engine for Intelligent Vision Applications

VC Formal Enabled QED Proofs on a RISC-V Core


Podcast EP178: An Overview of Advanced Power Optimization at Synopsys with William Ruby

Podcast EP178: An Overview of Advanced Power Optimization at Synopsys with William Ruby
by Daniel Nenni on 08-25-2023 at 10:00 am

Dan is joined by William Ruby, director of product management for Synopsys Power Analysis products. He has extensive experience in the area of low-power IC design and design methodology, and has held senior engineering and product marketing positions with Cadence, ANSYS, Intel, and Siemens. He also has a patent in high-speed cache memory design.

Dan explores new approaches to power analysis and power optimization with William, who explains strategies for increasing accuracy of early power analysis, when there is more opportunity to optimize the design. Enhanced modeling techniques and new approaches to computing power are discussed. The benefits of emulation for workload-based power analysis are also explored.

The views, thoughts, and opinions expressed in these podcasts belong solely to the speaker, and not to the speaker’s employer, organization, committee or any other group or individual.


The First TSMC CEO James E. Dykes

The First TSMC CEO James E. Dykes
by Daniel Nenni on 08-25-2023 at 6:00 am

James Dykes TSMC CEO (1)

Most people ( including ChatGPT) think Morris Chang was the first TSMC CEO but it was in fact Jim Dykes, a very interesting character in the semiconductor industry.

According to his eulogy: Jim came from the humblest of beginnings, easily sharing that he grew up in a house without running water and never had a bed of his own. But because of his own drive, coupled with compassion, leadership, and intelligence, he was indeed a genuine “success story.” He was honored in his profession with awards too numerous to list. During his long career he held leadership positions in several companies, including Radiation, Harris, General Electric, Philips North America and TSMC in Taiwan. His work took him to locales in Florida, California, North Carolina and Texas as well as overseas, but he returned to his Florida roots to retire, living both in Fort McCoy and St. Augustine.

Jim was known around the semiconductor industry as a friendly, funny, approachable person. I did not know him but some of my inner circle did. According to semiconductor lore, Jim Dykes was forced on Morris Chang by the TSMC Board of Directors due to his GE Semiconductor experience and Philips connections. Unfortunately Jim and Morris were polar opposites and didn’t get along. Jim left TSMC inside the two year mark and was replaced by Morris himself. Morris didn’t like Philips looking over his shoulder and stated that the TSMC CEO must be Taiwanese and he was not wrong in my opinion. Morris then hired Don Brooks as President of TSMC. I will write more about Don Brooks next because he had a lasting influence on TSMC that is not generally known.

One thing Jim left behind that is searchable is industry presentations. My good friend and co-author Paul McLellan covered Jim’s “Four Little Dragons of the Orient and an Emerging Role Model for Semiconductor Companies” presentation quite nicely HERE. This presentation was made in January of 1988 while Jim was just starting as CEO of TSMC. I have a PDF copy in case you are interested.

“I maintain we are no less than a precursor of an entirely new way of doing business in semiconductors. We are a value-added manufacturer with a unique charter… We can have no designs or product of our own. T-S-M-C was established to bridge the gap between what our customers can design and what they can market.”

“We consider ourselves to be a strategic manufacturing resource, not an opportunistic one. We exist because today’s semiconductor companies and users need a manufacturing partner they can trust and our approach, where we and our customers in effect spread costs among many users, yet achieve the economics each seeks, makes it a win-win for everyone.”

So from the very beginning TSMC’s goal was to be the Trusted Foundry Partner which still stands today. From the current TSMC vision and mission statement:

“Our mission is to be the trusted technology and capacity provider of the global logic IC industry for years to come.”

Another interesting Jim Dykes presentation “TSMC Outlook May 1988” is on SemiWiki. It is more about Taiwan than TSMC but interesting  just the same.

“Taiwan, by comparison, is more like Silicon Valley. You find in Taiwan the same entrepreneurial spirit the same willingness to trade hard work for business success and the opportunities to make it happen, that you find in Santa Clara County, and here in the Valley of the Sun. Even Taiwan’s version of Wall Street will seem familiar to many of you. There’s a red-hot stock market where an entrepreneur can take a company public and become rich overnight.”

I agree with this statement 100% and experienced it first hand in the 1990s through today, absolutely.

I was also able to dig up a Jim Dykes presentation “TO BE OR NOT TO BE” from 1982 when he was VP of the Semiconductor Division at GE. In this paper Jim talks about the pros and cons of being a captive semiconductor manufacturer. Captive is what we now call system fabless companies or companies that make their own chips for complete systems they sell (Apple). Remember, at the time, computer system companies were driving the semiconductor industry and had their own fabs: IBM, HP, DEC, DG, etc… so we have come full circle with systems companies making their own chips again.

Speaking of DG (Data General), I read Soul of a New Machine by Tracy Kidder during my undergraduate studies and absolutely fell in love with the technology. In fact, after graduating, I went to work for DG which was featured in the book.

I have a PDF copy of Jim’s “TO BE OR NOT TO BE” presentation in case you are interested.

Also read:

How Philips Saved TSMC

Morris Chang’s Journey to Taiwan and TSMC

How Taiwan Saved the Semiconductor Industry