Bronco Webinar 800x100 1

IEDM 2023 – 2D Materials – Intel and TSMC

IEDM 2023 – 2D Materials – Intel and TSMC
by Scotten Jones on 02-20-2023 at 6:00 am

Slide1

Intel and TSMC make up two of the three leading edge logic companies. At IEDM held in December 2022, Intel presented a paper on 2D Materials and TSMC presented 6 papers. Clearly 2D materials are of great interest at least to two of the three leading edge logic companies. Before diving into the papers, some background context is needed.

Logic Scaling

Logic designs are made up of standard cells, if you are going to scale logic to increase density, the standard cells must shrink.

The height of a standard cell is typically characterized as the Metal-2 Pitch (M2P) multiplied by the number of tracks. While this is a useful metric, it glosses over the fact that the cell height must also encompass the devices that make up the cell. Figure illustrates a 7.5-track standard cell and shows the M2P and tracks on the left of the cell and also to the right of the cell is a cross sectional view of the corresponding device structure.

The width of a standard cell is made up of some number of Contacted Poly Pitches (CPP) with the number depending on the cell type and how the diffusion breaks at the edges of the cell are handled. Once again, CPP is made up of a device structure that must shrink when CPP shrinks. Figure 1 illustrates CPP and at the bottom has a cross sectional view of the device structure.

Figure 1. Standard Cell.

 Intel, Samsung, and TSMC, have all made the switch from planar devices to FinFETs and are now at the beginning stages of the transition to Horizontal Nano-Sheets (HNS). Samsung is in production with HNS now, and Intel and TSMC have announced HNS production targets of 2024 and 2025 respectively.

Figure 2 illustrates the device structure and dimensions that make up cell height.

Figure 2. Standard Cell Height.

 The change over to HNS offers multiple opportunities to shrink cell height. HNS can replace multiple fin nFET and pFET devices with single nano-sheet stacks shrinking the height impact of the devices. Forksheet and CFET enhancements to HNS can reduce or even eliminate n-p spacing.

 CPP is made up of Gate Length (Lg), Spacer Thickness (Tsp) and Contact Width (Wc), see figure 3.

Figure 3. Contacted Poly Pitch.

CPP can be scaled down by reducing Lg, Tsp, or Wc or any combination of the three. Lg is limited by the devices ability to provide acceptable leakage. Figure 4 illustrates Lg length for various devices.

Figure 4. Gate Length Scaling.

 From figure 4 constraining the channel thickness and/or increasing the number of gates enables shorter Lg.

So called 2D materials are made up of a monolayer of material less than 1nm thick improving gate control over the channel and enabling Lg down to ~5nm. At these dimensions silicon has poor mobility and other materials are used that have higher mobility and higher band gap further reducing leakage. The ability to scale Lg to ~5nm enables a significant shrink of CPP and therefore smaller standard cells.

2D Material Challenges

Transition Metal Dichalcogenides (TMD) such as MoS2, WS2, or WSe2, have been identified as materials of interest with high mobility at monolayer thicknesses (silicon has poor mobility at these dimensions). There are several challenges/questions that need to be addressed for practical use of these materials and they are explored in the 7 papers that will be discussed:

  1. Device performance – do devices fabricated with these materials really provide good drive current and low leakage at short Lg.
  2. Contacts – 2D TMD films are atomically smooth and hard to make good low resistance contact to.
  3. Film formation – currently MOCVD at high temperature on a sapphire substrate is used to form the 2D films and then the resulting film is transferred to a 300mm silicon wafer for further processing. This is not a practical production process.

Presented Results

In paper 7.5, “Gate length scaling beyond Si: Mono-layer 2D Channel FETs Robust to Short Channel Effects,” C. J. Dorow, et. Al., of Intel explored device performance.

The ultimate goal for 2D material based devices is a stack of 2D layers similar to the HNS stacks but with each channel being thinner enabling shorter Lg and more layers in a stack. Figure 5 illustrates the difference.

Figure 5. HNS Versus 2D Stack.

 Intel did a wet transfer of an MBE grown MoS2 film over a back gate and then evaluated the device with a back gate and also with an added front gate down to a source-drain distance of 25nm. Figure 6 illustrates the device structure.

Figure 6. Intel 2D Device Structure.

 Intel encountered some delamination issues in their experiments but were able to experimentally confirm their modeling result and conclude that a double gated device should be able to scale down to at least 10nm with low leakage, see figure 7.

Figure 7. Experimental Results (left side) and Simulation Results (right side).

In paper 28.4, “Comprehensive Physics Based TCAD Model for 2D MX2 Channel Transistors,” D. Mahaveer Sathaiya, et. al., of TSMC, discussed a comprehensive simulation model of 2D devices and calibrated the model against 3 datasets. Having the ability to model 2D devices accurately will be key to the further development of the technology.

In paper 28.1, “Computational Screening and Multiscale Simulation of Barrier-Free Contacts for 2D Semiconductor pFETs,” Ning Yang, et. al., of TSMC, used ab initio calculations to screen contact materials for 2D devices.

The best reported experimental results for contact resistance to WSe2 are 950 Ω·μm and in this work Co3Sn2S2 is projected to be able to achieve 20 Ω·μm approaching the quantum limit. Furthermore, simulated devices are projected to produce ~2 mA/μm on state current. Sputtering on a sapphire substrate followed by a high-temperature annealing process (800 ̊C) was shown to produce Co3Sn2S2 with the expected chemical composition and crystalline structure.

In paper 7.2, “High-Performance Monolayer WSe2 p/n FETs via Antimony-Platinum Modulated Contact Technology towards 2D CMOS Electronics,” Ang-Sheng Chou, et. al., of TSMC, presented experimental results for Sb-Pt modulated contacts that achieve record contact resistance of 750 Ω·μm for pFET and 1,800 Ω·μm for nFET on WSe2. An on current of ~150 μA/μm was achieved. These results are not as good as the projections from paper 28.1 but represent experimental results versus simulations.

In paper 7.3, “pMOSFET with CVD-grown 2D semiconductor channel enabled by ultra-thin and fab-compatible spacer doping,” Terry Y.T. Hung, et. al, of TSMC, work towards a production type of pFET is presented. A lot of 2D material work is done on Schottky diodes but MOSFETs have lower access resistance. In order to create practical MOSFETs a CVD grown channel is needed with doped spacers. In this paper broken bandgap doped spacers are created by treating WSe2 with O2 plasma to create WOx as a dopant. The process is self-aligned and self-limiting as illustrated in figure 8.

Figure 8. Self-aligned and Self-limited doped spacer formation.

The CVD grown 2D layers are still grown separately and then transferred but other parts of the process are production compatible. The devices achieved one of the lowest Rc ~ 1,000 Ω·μm among

transistors with WSe2 channel and relatively high Ion > 10-5 A/μm for a good S.S. < 80mV/dec.

In paper 7.4, “Nearly Ideal Subthreshold Swing in Monolayer MoS2 Top-Gate nFETs with Scaled EOT of 1 nm,” Tsung-En Lee, et. al., of TSMC showed an ALD grown Hf-based gate oxide of ~1nm EOT on CVD grown MoS2 with a top gate and achieve low leakage and a nearly ideal subthreshold swing of 68 MV/dec. Pinhole free oxide on TMD materials are very difficult to achieve and this work showed excellent results.

The final paper is 34.5, “First Demonstration of GAA Monolayer-MoS2 Nanosheet nFET with 410 μA/μm ID at 1V VD at 40nm gate length,” Yun-Yan Chung, et. al. of TSMC showed an MoS2 device with good performance fabricated with an integrated process flow.

Figure 9. illustrates a simulation of the process flow for a two-layer device stack.

Figure 9. Simulated two layer device process.

Although further research is still needed in this paper 2 and 4 stacks of TMD and sacrificial material were sequentially deposited.

Figure 10. shows TEM images of the resulting stacks.

Figure 10. TEM of the deposited 2D/sacrificial material stacks.

 Sequential deposition of the 2D materials and sacrificial layers is a far more production type of process versus film transfer and likely to be lower cost as well.

The resulting stacks were then etched into fins using a metal hard mask. Figure 11. illustrates the “fin” formation results.

Figure 11. 2D Material stack fins.

 As is the case with horizontal nanosheet stacks, an inner spacer is needed to reduce capacitance. To form the inner spacer an additional sacrificial material is needed to prevent collapse of the 2D layers. Figure 12. illustrates the inner spacer process.

Figure 12. Inner Spacer formation.

Finally, metal edge contacts are formed, and the channels are released. Figure 13. illustrates the metal edge contacts.

Figure 13. Metal Edge Contacts.

The resulting devices have high contact resistance due to lack of doping in the contacts and extension regions. A 1 layer device with 40nm Lg was demonstrated with a Vth of ~0.8 volts, SS of ~250 mV/dec and drive current of 410 μA/μm.

Conclusion

These 7 papers illustrate both the excellent progress being made toward 2D devices and the level of interest at two of the leading-edge device producers. Some recent projections I have completed suggest that 2D CFETs can achieve logic density 5x of the current densest production standard cells. 2D CFETs are likely a technology for the 2030s as opposed to the 2020s and illustrate that logic scaling is nowhere near being at an end.

Also Read:

IEDM 2022 – Imec 4 Track Cell

IEDM 2022 – TSMC 3nm

IEDM 2022 – Ann Kelleher of Intel – Plenary Talk


Podcast EP144: How Andes Supplies RISC-V Cores to the World with Frankwell Lin

Podcast EP144: How Andes Supplies RISC-V Cores to the World with Frankwell Lin
by Daniel Nenni on 02-17-2023 at 10:00 am

Dan is joined by Frankwell Lin. Frank co-founded Andes Technology in 2005 and served as President from 2006. He became Chairman and CEO in 2021. Under his leadership, Andes is recognized as a top supplier of embedded CPU IP in the semiconductor industry.

Dan explores how Andes became such a strong supplier of RISC-V cores with Frank. Frank explains how Andes chose the RISC-V architecture and the vast array of applications that Andes supports with high quality, proven IP. Dan discusses the future with Frank as well. Where will Andes take its portfolio and expertise next?

The views, thoughts, and opinions expressed in these podcasts belong solely to the speaker, and not to the speaker’s employer, organization, committee or any other group or individual.


CEO Interview: Axel Kloth of Abacus

CEO Interview: Axel Kloth of Abacus
by Daniel Nenni on 02-17-2023 at 6:00 am

Axel Kloth13181

A physicist by training, Axel is used to the need for large-scale compute. He discovered over 30 years ago that scalability of processor performance was paramount for solving any computational problem. That necessitated a new paradigm in computer architecture. At Parimics, SSRLabs and Axiado he was able to show that new thinking was needed, and what novel practical solutions could look like. Axel is now repeating that approach with Abacus Semi.

What is Abacus Semiconductor Corporation’s vision?
Abacus Semi envisions a future in which supercomputers can be built with Lego-like building blocks – mix and match any combination of processors, accelerators and smart multi-homed memories. We believe that supercomputers today do not fulfill the requirements of the users. They do not scale nearly linearly. Oftentimes, 100,000 servers making up a supercomputer can be found to provide just 5,000 times the performance of a single server. That is largely due to the fact that today’s supercomputers in essence are commercial off the shelf (COTS) devices, without any consideration of communication between those servers for instruction- and data-sharing at low levels of latency and a high level of bandwidth. Another drawback is that accelerators for special-purpose applications do not integrate easily into supercomputers. We have a different view on the basic building blocks – very similar to Legos. If the programmable elements such as processors are used for orchestration of the workload, then accelerators carry out the work, and data comes in and exits through dedicated I/O nodes, while large-scale smart multi-homed memory subsystems keep the intermediate data at hand at all times.

How did Abacus Semiconductor Corporation begin?
Axel is a physicist and computer scientist by training, and as such has used supercomputers for decades and was frustrated by the complexity of deploying and using them, by the lack of linear scaling, and by the enormous cost associated with them. As a result, he set out to fix what could be fixed, always assuming a few basic fundamentals. He started on this journey with Parimics, a vision processor company in 2004, and then with Scalable Systems Research Labs, Inc (SSRLabs) in 2011, with a short detour to a secure processor startup, and now to Abacus Semiconductor Corporation in 2020.

A modern supercomputer should allow the integration of accelerators easily both in hardware and in software, it should be able to provide very large memory configurations in both exclusive and shared memory partitions, and it should be on par in cost with COTS-based systems while keeping operating costs down. Especially the integration of accelerators for numerically intensive applications, for matrix and tensor math, for Artificial Intelligence (AI) and Machine Learning (ML) as well as the need for very large cache coherent memory shared across many processors prove to be good and future-proof calls as today’s requirements for GPT-3 and ChatGPT call for memory arrays of sizes that are not supported in today’s processors.

As a computer scientist, it was clear to Axel that fixed-function devices provide a vastly superior performance, use less power and less silicon real estate than programmable elements, and as such a modern supercomputer should allow for the integration of all kinds of accelerators while keeping the programmability of a processor at hand for orchestration of workloads and for executing those tasks for which no hardware exists.

You mentioned you have some recent developments to share. What are they?
We are very excited to let you know that we have assessed all of the code and the building blocks that we have created over the past more than a decade, and our requirements are all met. With our Server-on-a-Chip, our smart multi-homed memory subsystems, our math and database accelerators we have shown in simulations that we will achieve a vastly better linearity of scale-out. For most applications and configurations, it seems that we will hit an 80% scale-out factor, i.e. a supercomputer consisting of 100,000 servers should provide roughly 80,000 the performance of a single one. Our interface will provide enough bandwidth per pin to allow for over 3.2 TB/s of bandwidth into and out of our accelerators and processors. The smart multi-homed memory subsystem will provide nearly 1 TB/s of bandwidth into and out of the chip. The security and coherency domains can be set for each memory subsystem. We have made progress in building our team – both engineering and management – and we have a term sheet in hand. We are still assessing the validity and veracity of this term sheet, but at this point in time the conditions look good.

Tell us about these new chips you are building?
As stated before, we believe that in order to build a new generation of supercomputers, new processors, accelerators and smart multi-homed memories are needed. We also touched on the fact that today’s cores are incredibly good, and that the problem in supercomputers are not the processors cores, but nearly everything around them. We are using RISC-V processor cores that we modified as the basic programmable building element. Doing that allows us to partake in the growth of the ecosystem around RISC-V, which I believe shows the fastest growth of any processor that I have seen in my career. We removed all of the performance-limiting factors around RISC-V, added hardware support for virtualization and hypervisors, optimized the cache interfaces, and made sure it can connect to our internal processor information superhighway. We are also using accelerators for all I/O and legacy interfaces, and because we do this in a Lego-like fashion, these blocks are being reused in our Server-on-a-Chip and in our integer database processor and the orchestration processor, which are in fact the same hardware with different firmware. The Lego-like principles extend to our smart multi-homed memory subsystem as well. As such, our development effort is relatively low compared to other companies that focus on processor design and supercomputers. Due to our philosophy of parallelism instead of having to crank up the clock frequencies we do not need to spend tons of money on the old cat-and-mouse game of physical design with dynamic timing closure going through multiple iterative rounds to squeeze out one more Hertz of clock frequency. All of that simplifies code and building block reuse, and that is why we try to build our own IP in-house and keep it that way.

What are the chips in the Abacus Semi family?
The chips we are designing are the Server-on-a-Chip that effectively combines an entire server onto one processor, the identical Supercomputer I/O Frontend, an Orchestration Processors, an Integer Database Processor (both of these deploy the same hardware but use different firmware), and a math accelerator as well as a set of smart multi-homed memories.

How are the Abacus Semi chips programmed?
Since we use a RISC-V processor as the underlying programmable element, we can call on the existing ecosystem. Our Server-on-a-Chip, the integer database processor and the orchestration processor are all fully RISC-V Instruction Set Architecture compatible. In other words, they all run Linux and FreeBSD, with GCC and LLVM/CLANG as compilers available for a while now. In fact, the entire LAMP (Linux/Apache/mySQL/PHP) and FAMP (FreeBSD/Apache/mySQL/PHP) stack is available for them, and as such, any PHP and Perl application runs on them unchanged. Due to the fact that we use a DPU-plus approach to networking, we have a piece of firmware available for our processors that acts like a filtering Network Interface Card (NIC) with offload capabilities and with DMA and Remote DMA functions, as well as with direct memory access to the applications processors. A similar offload for mass storage is available and offloads the applications processors from mass storage tasks, thereby making more of the applications processors’ time available for the user applications, with or without a hypervisor. Since the Server-on-a-Chip doubles as an I/O frontend for supercomputers, the supercomputer core does not need to carry out I/O or legacy interface functions; these are all relegated to the Server-on-a-Chip. That allows the users of a supercomputer to deploy the core in a bare-metal fashion, if so desired. The math accelerator for matrix and tensor math as well as for transforms uses openACC and openCL as outward-facing APIs, but we have a translation layer available that converts CUDA into our native command set.

Can you tell us more about your technology behind the scale-out improvement?
We believe that communication is key in scale-out, and more importantly, low-latency and high-bandwidth communication. As a result, we reviewed everything we had built for unnecessary layers of hierarchy of communication through bridges and interface adapters and interface converters. We removed all of them as necessary and possible. As a result, the communication between any two or more elements in our architecture provides the highest possible bandwidth given the restrictions in bump and ball count, and the need to traverse Printed Circuit Boards (PCBs), which necessitates CML-type High Speed Serial Links. However, we use the shortest possible FLITS and commensurate encoding, both of which enable optical and electrical communication. The interface that we have designed is available for broader adoption by anyone who is interested in using it, for a nominal licensing fee. It is wide enough to provide class-leading bandwidth while allowing resilience and error-detection features for system availability in the six nines region. It is also a smart interface in that it can recognize the topology of the network up to three deep in hierarchy autonomously, and it is designed to be on its own chiplet in case we find a partner that wants it but cannot design it into their own designs.

When will the Abacus Semi chips be available?
We are working with customers and partners to ensure a prototype tapeout in Q3 of 2025, and a volume-production set for FCS in Q1 of 2026.

Also Read:

CTO Interview: John R. Cary of Tech-X Corporation

Semiwiki CEO Interview: Matt Genovese of Planorama Design

CEO Interview: Dr. Chris Eliasmith and Peter Suma, of Applied Brain Research Inc.


Speeding up Chiplet-Based Design Through Hardware Emulation

Speeding up Chiplet-Based Design Through Hardware Emulation
by Kalar Rajendiran on 02-16-2023 at 10:00 am

Barriers on the Continuum to SiP

The first chiplets focused summit took place last month. So many accomplished speakers gave keynote talks on what direction should and would the Chiplets ecosystem evolution take. Corigine presented the keynote on what direction hardware emulation should and would evolve for speeding up chiplet- based designs. During a pre-conference tutorial session, Corigine shared customer-based case studies to highlight how Corigine’s MimicPro prototyping and emulation solutions addressed challenges introduced by chiplet-based designs.

The Chiplet Summit introduced a new tag line, “Chiplets Make Huge Chips Happen.” With large monolithic SoCs losing favor in the face of Moore’s Law slowing down, the new tag line highlights how chiplets make large SoCs possible. Of course, tag lines by themselves don’t make things happen. It takes an ecosystem, the companies within the ecosystem and the people at these companies that make things happen. One of those companies is Corigine. Corigine is a fabless semiconductor company that designs and delivers leading edge EDA tools.

Corigine presented insightful thoughts and discussed their innovative solutions during various sessions at the conference. If you missed these sessions, the following is a synthesis of the salient points from those sessions.

Chiplet-based Design Benefits, Challenges and Solutions

Aside from the economic benefit derived from an yield perspective compared to a large monolithic SoC, chiplets bring many additional benefits to the table. These benefits are namely, architectural partitioning, enabling of re-use, time-to-market and product family scalability. Of course, there are many challenges too. The following diagram shows the continuum of barriers when implementing a chiplet-based chip.

With Corigine’s focus on addressing the front-end barriers, the following are its learnings during the course of its chiplet-based data processing unit (DPU) chip development work.

Chiplets-based Chip Development and Emulation Requirements

A key consideration for a chiplet is the decision on where to place its various I/O ports. This of course is driven by the system requirements such as machine language (ML) processing functionality and datapath SIMD or MIMD organization. With an effective architectural decomposition of the system, the next set of requirements revolves around the interconnect’s attributes. The interconnects should be open, extensible and backwards compatible.  For example, as UCIe is being driven as a standard for the D2D interconnects, as the UCIe standard evolves, UCIe V2.0 should also support V1.0 based chiplets.

With the interconnects addressed, the next requirement is a pre-tapeout platform to support integration and verification of heterogeneous chiplets. The platform should be able to support a very large number of transistors and ensure IP protection and segmentation. Finally, none of the above matter if silicon and software co-development cannot be accomplished rapidly and successfully. The co-development platform must provide built-in logic analyzers with complex trigger mechanism capabilities to insert waveforms during software debug.

Corigine’s MimicPro Prototyping and Emulation Solutions

To address the co-development platform, Corigine developed a series of FPGA-based prototyping and emulation platforms by working with the silicon and software teams developing their own chiplet-based DPU chip. These platforms are essentially combined prototyping and emulation systems that can provide faster software turnaround time. They include functionality for collecting and analyzing data and introducing design-for-test and design-for-manufacturing features, thereby enabling software verification before tapeout.

The MimicPro solutions deliver an order-of-magnitude performance improvement over traditional emulators of similar class. Corigine’s patented distributed routing and fine-grain multi-user clocking enable linear performance scaling irrespective of the size of a block being emulated. The dedicated scalable clock/routing infrastructure enables higher utilization of resources for logic emulation.

Corigine MimicPro was initially optimized for performance and scalability, enhanced with visibility, portability and security. It essentially combines rich debugging features and confidential information protection and 10-100MHz level performance of prototyping. It continues to grow with Corigine in house SmartNIC / Data Processing Unit chiplet design.

The following chart showcases the resource utilization efficiency of a MimicPro system in real life use by a SmartNIC.

The following is what Corigine is addressing for chiplets with its MimicPro solutions.

MIMIC Product Information  

MimicPro™ 32

The Corigine MimicPro Prototyping System provides performance and speed for ASIC and software development for both enterprise and cloud operation, with utmost security and scalability. The MimicPro solution provides scalability from 4 to 32 FPGAs. The system also provides easy upgradeability to the latest available FPGAs. The Corigine MimicPro system is the industry’s next-generation platform for automating prototyping including manual partitioning operations, while providing a system-level view for optimum partitioning and performance. In addition, the MimicPro system adds deep local debug capabilities providing much greater visibility and faster elimination of bugs. Thus, the MimicPro system reduces the overall development time and cost-effectively accelerates software development without the dependence on costly emulation.

For more detailed MimicPro™-32 information, you can refer to Corigine’s product page.

MimicPro-32

MimicTurbo™ GT Card

Corigine MimicTurbo GT card based on the UltraScale+™ VU19P FPGA is designed to simplify the deployment of FPGA based prototyping at the desktop. The card can support up to 48 million ASIC gates each, has onboard DDR4 component memory and can be configured to operate with additional connected MimicTurbo GT cards. The card supports 64 GTY transceivers (16 Quads) along with the essential I/O interfaces.

Corigine MimicTurbo GT board is available from the Xilinx website. You can find more detailed product information on AMD/Xilinx FPGA-based Corigine MimicTurbo GT card on this page.

MimicTurbo GT Card

Corigine MimicTurbo GT 1 FPAG board is available from the Xilinx website. You can find more detailed product information on Xilinx FPGA-based Corigine MimicTurbo GT card on this page.

Corigine at DVCon US 2023

Corigine is at DVCon demonstrating its MimicPro-32 this month.
Time: February 27 th – March 1 st
Location: DoubleTree by Hilton Hotel San Jose.
Registration: https://dvcon.org/registration/

Also Read:

Alphawave Semi at the Chiplet Summit

Who will Win in the Chiplet War?

The Era of Chiplets and Heterogeneous Integration: Challenges and Emerging Solutions to Support 2.5D and 3D Advanced Packaging


ML-Based Coverage Acceleration. Innovation in Verification

ML-Based Coverage Acceleration. Innovation in Verification
by Bernard Murphy on 02-16-2023 at 6:00 am

Innovation New

We looked at another paper on ML-based coverage acceleration back in April 2022. Here is a different angle from IBM. Paul Cunningham (Senior VP/GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and now Silvaco CTO) and I continue our series on research ideas. As always, feedback welcome. And don’t forget to come see us at DVCon, first panel (8am) on March 1st 2023 in San Jose!

The Innovation

This month’s pick is Using DNNs and Smart Sampling for Coverage Closure Acceleration. The authors presented the paper at the 2020 MLCAD Workshop and are from IBM Research in Haifa and the University of BC in Canada.

The authors intent is to improve coverage on events which have been hit only rarely. They demonstrate their method for a CPU design, based on refining instruction set (IS) test templates for an IS simulator. Especially interesting in this paper is how they manage optimization in very noisy low statistics data where conventional gradient-based comparisons are problematic. They suggest several methods to overcome this challenge.

Paul’s view

Here is another paper on using DNNs to improve random instruction generators in CPU verification which, given the rise of Arm-based servers and RISC-V, is becoming an increasingly hot topic in our industry.

The paper begins by documenting a baseline non-DNN method to improve random instruction coverage. This method works by randomly tweaking instruction generator parameters and banking the tweaks if they improve coverage. The tweaking process is based on a gradient-free numerical method called implicit filtering (see here for a good summary), which works kind of like zoom out-then-in search: start with big parameter tweaks and zoom in to smaller parameter tweaks if the big tweaks don’t improve coverage.

The authors then accelerate their baseline method using a DNN to assess if the parameter tweaks will improve coverage before going ahead with costly real simulations to precisely measure the coverage. The DNN is re-trained after each batch of real simulations, so it is continuously improving.

The paper is well written, and the formal justification for their method is clearly explained. Results are presented on two arithmetic pipes of the IBM NorthStar processor (5 instructions and 8 registers). It’s a simple testcase and sims are run for only 100 clock cycles measuring only 185 cover points. Nevertheless, the results do show that the DNN-based method is able to hit all the cover points with half as many sims as the baseline implicit filtering method.  Nice result.

Raúl’s view

As Paul says, we are revisiting a topic we have covered before. In April 2022 we reviewed a paper by Google which incorporated a Control-Data-Flow-Graph into a neural network. Back in December 2021we reviewed a paper from U. Gainesville using Concolic (Concrete-Symbolic) testing to cover hard to reach branches. This month’s paper introduces a new algorithm for coverage-directed test generation combining test templates, random sampling, and implicit filtering (IF) with a deep neural network (DNN) model. The idea is as follows:

As is common in coverage directed generation, the approach uses test templates, vectors of weights on a set of test parameters that guide random test generation. Implicit filtering (IF) is an optimization algorithm based on grid search techniques around an initial guess to maximize chances to hit a particular event. To cover multiple events, the IF process is simply repeated for each event, called the parameter-after-parameter approach (PP). To speed up the IF process, the data collected during the IF process is used to train a DNN, which approximates the simulator and is much faster than simulating every test vector.

The effectiveness of the algorithms is evaluated employing an abstract high-level simulator of part of the NorthStar processor. Four algorithms are compared: Random sampling, PP, DNN and combining IF and DNN. The results of three experiments are reported:

  1. Running the algorithms with a fixed number of test templates, up to 400 runs. Combining IF and DNN is superior, missing only up to 1/3 of the hard to hit events
  2. Running the algorithms until all hard to hit events are covered. IF and DNN converges with half the number of test templates
  3. Running the last algorithm (IF and DNN) 5 times. All runs converge with a similar number of test templates, even the worst using ~30% less test templates than other algorithms

This is a well-written paper on a relevant problem in the field. It is (almost) self-contained, it is easy to follow, and the algorithms employed are reproducible. The results show a reduction of “the number of simulations by a factor of 2 or so” over implicit filtering. These results are based on one relatively simple experiment, NorthStar. I would have liked to see additional experimentation and results; some can be found in other publications by the authors.


The State of FPGA Functional Verification

The State of FPGA Functional Verification
by Daniel Payne on 02-15-2023 at 10:00 am

Design Styles min

Earlier I blogged about IC and ASIC functional verification, so today it’s time to round that out with the state of FPGA functional verification. The Wilson Research Group has been compiling an FPGA report every two years since 2018, so this marks the third time they’ve focused on this design segment. At $5.8 billion the FPGA market is sizable, and forecasted to grow to $8.1 billion by 2025. FPGAs started out in 1984 with limited gate capacity, and have now grown to include millions of gates, processors and standardized data protocols.

Low volume applications benefit from the NRE of FPGA devices, and engineers can quickly prototype their designs by verifying and validating at speed. FPGAs now include processors, like: Xilinx Zynq UltraSCALE, Intel Stratix, Microchip SmartFusion. From the 980 participants in the functional verification study, the FPGA and programmable SoC FPGA design styles are the most popular.

Design Styles

As the size of FPGAs has increased recently, the chance of a bug-free production release has dropped to just 17%, which is even worse than the 30% of IC and ASIC projects for correct first silicon. Clearly, we need better functional verification for complex FPGA systems.

FPGA bug escapes into production

The types of bugs found in production fall into several categories:

  • 53% – Logic or Functional
  • 31% – Firmware
  • 29% – Clocking
  • 28% – Timing, path too slow
  • 21% – Timing, path too fast
  • 18% – Mixed-signal interface
  • 9% – Safety feature
  • 8% – Security feature

Zooming into the largest category of failure, logic or functional, there are five root causes.

Root Causes

FGPA projects mostly didn’t complete on time, once again caused by the larger size of the systems, complexity of the logic and even the verification methods being used.

FPGA Design Schedules

Engineers on an FPGA team can have distinct titles like design engineer or verification engineer, yet on 22% of projects there were no verification engineers – meaning that the design engineers did double-duty and verified their own IP. Over the past 10 years there’s been a 38% increase in the number of verification engineers on an FPGA project, so that’s progress towards bug-free production.

Number of engineers

Verification engineers on FPGA projects spent most of their time on debug tasks at 47%:

  • 47% – Debug
  • 19% – Creating test and running simulation
  • 17% – Testbench development
  • 11% – Test Planning
  • 6% – Other

The number of embedded processors has steadily grown over time, so 65% of FPGA designs have one or more processor cores now, increasing the amount of verification between hardware, software interfaces; and managing on-chip networks.

Embedded Processors

The ever-popular RISC-V processor is embedded in 22% of FPGAs, and AI accelerators are used in 23% of projects. There are 3-4 average number of clock domains used on FPGAs, and they require gate-level timing simulations for verification, plus the use of static Clock Domain Crossing (CDC) tools for verification.

Security features are added to 49% of FPGA designs to hold sensitive data, plus 42% of FPGA projects adhere to safety-critical standards or guidelines. On SemiWiki we’ve often blogged about ISO 26262 and DO-254 standards. Functional Safety (FuSa) design efforts take between 25% to 50% of the overall project time.

Safety Critical Standards

The top three verification languages are VHDL, SystemVerilog and Verilog; but also notice the recent jumps in Python and C/C++ languages.

Verification Languages

The most popular FPGA methodologies and testbench base-case libraries are: Accellera UVM ,OSVVM and UVVM. The Python-based cocotb was even added as a new category for 2022.

Verification Methodologies

Assertion languages are led by SystemVerilog Assertions (SVA) at 45%, followed by Accellera Open Verification Library (OVL) at 13% and PSL at 11%. FPGA designs may combine VHDL for RTL design along with SVA for assertions.

Formal property checking is growing amongst FPGA projects, especially as more automatic formal apps have been introduced by EDA vendors.

Formal Techniques

Simulation-based verification approaches over the past 10 years shows steady adoption, listed in order of relevance: Code coverage, functional coverage, assertions, constrained random.

Summary

The low 17% bug-free number for FPGA projects in 2022 that made it into production was the most surprising number to me, as the effort to recall or re-program a device in the field is expensive and time consuming to correct. A more robust functional verification approach should lead to fewer bug escapes into production, and dividing the study participants into two groups does show the benefit.

Verification Adoption

Read the complete 18 page white paper here.

Related Blogs


Area-optimized AI inference for cost-sensitive applications

Area-optimized AI inference for cost-sensitive applications
by Don Dingee on 02-15-2023 at 6:00 am

Expedera uses packet-centric scalability to move up and down in AI inference performance while maintaining efficiency

Often, AI inference brings to mind more complex applications hungry for more processing power. At the other end of the spectrum, applications like home appliances and doorbell cameras can offer limited AI-enabled features but must be narrowly scoped to keep costs to a minimum. New area-optimized AI inference technology from Expedera is taking on this challenge, targeting 1 TOPS performance in the smallest possible chip area.

Optimized for one model, but maybe not for others

Fitting into an embedded device brings constraints and trade-offs. For example, many teams concentrate on developing the inference model for an application using a GPU-based implementation, only to discover that no amount of optimization will get them anywhere near the required power-performance-area (PPA) envelope.

A newer approach uses a neural processing unit (NPU) to handle AI inference workloads more efficiently, delivering the required throughput in less die size and power consumption. NPU hardware typically scales up or down to meet throughput requirements, often measured in tera operations per second (TOPS). In addition, compiler software can translate models developed in popular AI modeling frameworks like PyTorch, TensorFlow, and ONNN into run-time code for the NPU.

Following a long-held principle of embedded design, there’s a strong temptation for designers to optimize their NPU hardware in their application, wringing out every last cent of cost and milliwatt of power. However, if only a few AI inference models are in play, it might be possible to optimize hardware tightly using a deep understanding of model internals.

Model parameters manifest as operations, weights, and activations, varying considerably from model to model. Below is a graphic comparing several popular lower-end neural network models.

On top of these differences sits the neural network topology – how execution units interconnect in layers – adding to the variation. Supporting different models for additional features or modes leads to overdesigning with a one-size-fits-all NPU big enough to cover performance in all cases. However, living with the resulting cost and power inefficiencies may be untenable.

NPU co-design solves optimization challenges

It may seem futile to optimize AI inference in cost-sensitive devices where models are unknown when the project starts or running more than one model for mode preferences. But, is it possible to tailor an NPU more closely to a use case without enormous investments in design time or running the risk of an AI inference model changing later?

Here’s where Expedera’s NPU co-design philosophy shines. The key is not hardcoding models in hardware but instead using software to map models to hardware resources efficiently. Expedera does this with a unique work sequencing engine, breaking operations down into metadata sent to execution units as a packet stream. As a result, layer organization becomes virtual, operations order efficiently, and hardware utilization increases to 80% or more.

 

 

 

 

 

In some contexts, packet-centric scalability unlocks higher performance, but in Expedera’s area-optimized NPU technology, packets can also help scale performance down for the smallest chip area.

Smallest possible NPU for simple models

Customers say a smaller NPU that matches requirements and keeps costs to a minimum can make the difference between having AI inference or not in cost-sensitive applications. On the other hand, a general-purpose NPU might have to be overdesigned by as much as 3x, driving up die size, power requirements, and additional costs until a design is no longer economically feasible.

Starting with its Origin NPU architecture, fielded in over 8 million devices, Expedera tuned its engine for a set of low to mid-complexity neural networks, including MobileNet, EfficientNet, NanoDet, Tiny YOLOv3, and others. The results are the new Origin E1 edge AI processors, putting area-optimized 1 TOPS AI inference performance in soft NPU IP ready for any process technology.

“The focus of the Origin E1 is to deliver the ideal combination of small size and lower power consumption for 1 TOPS needs, all within an easy-to-deploy IP,” says Paul Karazuba, VP of Marketing for Expedera. “As Expedera has already done the optimization engineering required, we deliver time-to-market and risk-reduction benefits for our customers.”

Seeing a company invest in more than just simple throughput criteria to satisfy challenging embedded device requirements is refreshing. For more details on the area-optimized AI inference approach, please visit Expedera’s website.

Blog post: Sometimes Less is More—Introducing the New Origin E1 Edge AI Processor

NPU IP product page: Expedera Origin E1


Interconnect Choices for 2.5D and 3D IC Designs

Interconnect Choices for 2.5D and 3D IC Designs
by Daniel Payne on 02-14-2023 at 10:00 am

STCO min

A quick Google search for “2.5D 3D IC” returns 669,000 results, so it’s a popular topic for the semiconductor industry, and there are plenty of decisions to make, like whether to use an organic substrate or silicon interposer for interconnect of heterogenous semiconductor die. Design teams using 2.5D and 3D techniques soon realize that there are many data formats to consider:

  • GDS – chiplet layout
  • LEF/DEF – Library Exchange Format, Design Exchange Format
  • Excel – ball map
  • Verilog – logic design
  • ODB++ – BGA package
  • CSV – Comma Separated Value

A recent e-book from Siemens provides some much-needed guidance on the challenges of managing the connectivity across the multiple data formats. Source data gets imported into their connectivity management tool, and then each implementation tool receives the right data for analyzing thermal, SI (Signal Integrity), PI (Power Integrity), IR drop, system-level LVS, and assembly checking.

For consistency your design team should use a single source of truth, so that when a design change is made then the full system is updated, and each implementation tool has the newest input data. The Siemen’s workflow stays in sync through the system-level LVS approach.

There’s no standard file format between package, interposer and board teams, yet by using ODB++ you can take in package and PCB data to the planning tool, allowing your team to communicate and optimize using any EDA tool. A package designer can move bumps around, and then the silicon team can review the changes using DEF files to accept them.

The largest system in package designs can have one million total pins, so your tools need to handle that capacity. Yield on a substrate depends on the accurate placement of via, via arrays and metal areas. Your substrate or interposer layout tool has to manage the interfaces properly, and make sure to get the foundry or OSAT assembly design kit for optimal results.

From the Siemens tool you have a planning cockpit to graphically and quickly create a virtual prototype of the complete 2.5/3D package assembly, aka – digital twin. This methodology makes possible System Technology Co-Optimization (STCO).  Making early trade-offs between architecture and technology produce the best results for a new system, by using predictive analysis to sort through all the different design scenarios. Predictive analysis validates that the net names are consistent between the die, interposer and package, thus avoiding shorts and opens.

System Technology Co-Optimization

System LVS ensures that all design domains are DRC and LVS clean, validating connections at the package bumps, interposer and die.

Physical verification is required during many steps:

  • Die level DRC and LVS
  • Interposer
  • Package
  • All levels together

The Siemens planning tool does all of this, while keeping the system design correct from start to finish, eliminating late surprises. An equivalence check also needs to be run between the planning tool and the final design.

Using a digital twin methodology your team can now verify that the package system is correct. Early mistakes are quickly caught through verification, like “pins up, pins down”, through an overlaps check between the package, silicon and interposer. Bump locations will also be checked for consistency between package and IC teams. Checks can be run after every change or update, just to ensure that there are no surprises.

Summary

The inter-related teams of IC, package and board can now work together by using a digital twin approach, as offered by Siemens. Not many EDA vendors have the years of experience in tool flows for all three of these areas,  plus you can add many of your favorite point EDA tools. Collaboration and optimization are possible for the challenges of 2.5D/3D interconnects.

Read the full 14 page e-book from Siemens.

Related Blogs

 


PCIe 6.0: Challenges of Achieving 64GT/s with PAM4 in Lossy, HVM Channels

PCIe 6.0: Challenges of Achieving 64GT/s with PAM4 in Lossy, HVM Channels
by Kalar Rajendiran on 02-14-2023 at 6:00 am

Multi Level Challenges

As the premier high-speed communications and system design conference, DesignCon 2023 offered deep insights from various experts on a number of technical topics. In the area of high-speed communications, PCIe has a played a crucial role over the years in supporting increasingly higher communications speed with every new revision. Revision 6.0, the latest revision of this communications interface standard enables system designers to achieve advances in the deployment of AI inference engines and co-processors in data centers. Consequently, PCIe 6.0 was a hot topic at the conference, not just for the 64GT/s speed but also for understanding the engineering challenges to reliably deliver that speed.

PCIe 6.0 poses a demanding set of chip and system design challenges on engineers. To reliably deliver the full benefits of PCIe 6.0, collaboration and cooperation are needed to standardize specifications in the areas of PCIe card, cable, connector assembly, test method, measurement and tools and PCIe PHY and controller IP. An experts panel to discuss these very topics included David Bouse from Tektronix, Rick Eads from Keysight Technologies, Steve Krooswyk from Samtec, Madhumita Sanyal from Synopsys and Timothy Wig from Intel. The panel session was moderated by Pegah Alavi from Keysight Technologies.

Pegah opened the session by highlighting the challenges introduced by multi-level signaling (MLS) when the switch was made from NRZ to PAM4 signaling to support 64GT/s. The adoption of MLS has opened up the path to continue increasing data communications speeds. By mapping more than 1 bit into a transmitted symbol, the required bandwidth/bit is reduced. But MLS introduces a lot of challenges too, which need to be overcome to achieve the speed benefit in a reliably manner.

Under MLS, the signal to noise ratio worsens, negatively impacting the performance of the channel. Consequentially, all aspects of the channel need to be paid close attention to. With that introduction, Pegah set the stage for the panelists to update the audience on their respective areas of focus to deliver a reliable PCIe 6.0 end-user solution. The following is a synthesis of the salient points from the session.

PCIe Card and Cable Form Factor Updates

Rev 6.0 of the PCIe Card Electromechanical (CEM) form factor specification is being finalized in 2023. The Rev 6.0 mechanical updates are completely redefining chassis retention on the North and East vias.

The CEM card physical form factor introduces two new power connectors at 48V to deliver 600W.

A shielded plane/south via approach has been introduced to shield the send signals from the receive signals. Without the shielding plane/south via approach, PCIe 6.0 channels would be completely broken, given known examples of inattentive card layout sabotaging even PCIe 5.0 channels.

Two PCIe cable form factors are being defined. Both these new form factors are distinct from previous PCIe cable solutions. An internal cable form factor is being defined based on the EDSFF-TA-1016 cable system targeting PCIe 5.0 and PCIe 6.0 speeds. An external cable form factor is being defined based on the industry standard CDFP. The Internal PCIe cable form factor has been characterized for a range of connectors and cables from multiple vendors, mounting styles and lengths.

Test Methods and Tools

The PCIe ecosystem is keeping PCIe 7.0 in mind as they define and develop tools and test methods for PCIe 6.0. After all, PCIe 7.0 spec (128 GT/s) is just around the corner as it is expected to arrive in the 2024-2025 time frame. The Tx, Rx and channel compliance requirements are kept in mind as the simulation, test and measurement methods are being developed to validate connectors and cables-connector assemblies. Forward Error Correction (FEC) has been introduced in PCIe 6.0, a first for the PCIe interface standard to accommodate the impact of channel loss.

PCIe v6.0 Retimer

All of the things presented above ensure that the cards, cables, connectors and assemblies are validated to support PCIe 6.0. Depending on the end market and application, a PCIe-based system will be deploying different channel topologies, leveraging the hardware listed above. Consequentially, each channel topology will bring with it, its own characteristic that would impact the channel performance.

The following chart shows four different channel topologies that are commonly found in PCIe-based systems.

 

From the PCIe PHY perspective, it needs to be able to optimize for all possible channel topologies. Given the Reduced Insertion Loss budget imposed by the PCIe 6.0 specification, how to ensure that the signal from the Root port will reach the destination port without losing fidelity.

The solution is the introduction of a PCIe 6.0 Retimer circuit. PCIe Retimers enable the expansion of PCIe over system boards, backplanes, cables, risers and add-in cards, irrespective of the channel topology that is deployed.  A Retimer is a physical layer and protocol-aware device but software-transparent, and can reside in any place in the channel between PCIe Root-port and End-point.  It fully recovers the data over any channel from the Host PCIe Root-port, extracts clock and re-transmits the clean data over another channel to the PCIe End-point device. The Retimer solution is implemented in the form of customized PHY and light controller logic for the MAC.

Summary

The panelists offered a number of tips and tricks and best practices throughout the session. When DesignCon makes the panelists’ presentation materials available on their website, it would be a good idea to download as reference materials. You may want to reach out to the panelists for more specific detailed information.

Also Read:

Optimization Tradeoffs in Power and Latency for PCIe/CXL in Datacenters

Synopsys Design Space Optimization Hits a Milestone

Webinar: Achieving Consistent RTL Power Accuracy


Optimization Tradeoffs in Power and Latency for PCIe/CXL in Datacenters

Optimization Tradeoffs in Power and Latency for PCIe/CXL in Datacenters
by Daniel Nenni on 02-13-2023 at 10:00 am

Power Latency Webinar min

PCI Express Power Bottleneck

Madhumita Sanyal, Sr. Technical Product Manager, and Gary Ruggles, Sr. Product Manager, discussed the tradeoffs between power and latency in PCIe/CXL data centers during a live SemiWiki webinar on January 26, 2023. The demands on PCIe continue to grow with the integration of multiple components and the challenge of balancing power and latency. The increasing number of lanes, multicore processors, SSD storage, GPUs, accelerators, and network switches have contributed to this growth in demand for PCIe in compute, servers, and datacenter interconnects. Gary and Madhumita provided expert insights on PCIe power states and power/latency optimization. I will cherry pick a few things that interested me.

Watch the full webinar for a more comprehensive understanding on Power, Latency for PCIe/CXL in Datacenters from Synopsys experts.

Figure 1. Compute, Server, and Data Center Interconnect Devices with Multiple Lanes Hit the Power Ceiling

Reducing Power with L1 & L2 PCIe Power States

In the early days of PCIe, the standard was primarily focused on PCs and servers, for example achieving high throughput. This early standard lacked considerations for what we would now consider green or mobile friendly. However, since the introduction of PCIe 3.0, PCI-SIG has placed a strong emphasis on supporting aggressive power savings while continuing to advance performance goals. These power savings are achieved through the implementation of a standard defined as link states. Link states range from L0 (everything on) to L3 (everything off) with intermediate states contributing various levels of power savings. Possible link states continue to be refined as the standard advances.

Madhumita explained that PCIe PHYs are the big power hogs, accounting for as much as 80% to power consumption in a fully-on (L0) state! The lower power, L1 state, now includes various sub-states, enabling the deactivation of transceivers, PLLs, and analog circuitry in the PHY. The L2 power state reflects a power-off state with only auxiliary power to support circuitry such as retention logic. L1 (and its sub-states) and L2 are the workhorses for fine-tuning power savings. PCIe 6.0 introduces the option of L0p, which allows for dynamic power down on a subset of lanes in a link while keeping the remainder fully active. This feature results in both a reduction of the number of active lanes via L0p, which lowers the bandwidth, with a simultaneous reduction in the power consumption.

With PCIe power states defined, the Synopsys experts delved deeper into the process for the host and device to determine the appropriate link state. A link in any form of sleep state will incur a latency penalty upon waking – known as exit latency – such as when transitioning to an L0 state to support communication with an SSD. To reduce the system impact of this penalty, the standard specifies a latency tolerance reporting (LTR) mechanism which informs the host of the latency tolerance of the device towards an interrupt request, ultimately guiding the negotiation process.

Using Clock-Gating to Reduce Activity

The range of power saving options in digital logic is well known. I was particularly interested in the usage of clock gating techniques to optimize energy consumption by eliminating wasted clock toggling on individual flops or banks of flops, even globally for entire blocks. Dynamic voltage and frequency scaling (DVFS) decreases power by reducing operating voltage and clock frequency on functions which can afford to run slower at times. Although DVFS can result in significant power savings, it also adds complexity to the logic. Finally, power gating allows for the shutting off both dynamic and leakage power at a block level, except perhaps for auxiliary power to support retention logic.

In addition to these options, there are other techniques such as the use of mixed VT libraries. Madhumita also expanded on board and backplane considerations in balancing performance vs. power in PCIe 6.0. Low power can be achieved with lower channel reaches. For a more comprehensive discussion on these topics, I encourage you watch the webinar.

Latency in PCIe/CXL: Waiting is the Hardest Part!

Gary Ruggles recommends utilizing optimized embedded endpoints to reduce latency. These endpoints avoid the need for the full PCIe protocol from the host, through a physical connection and again through the full PCIe protocol on the device side. For example, a NIC interface could be embedded directly in the same SoC as the host, connecting to the PCIe switch directly through a low latency interface.

Gary also expanded on using a faster clock to decrease latency, while acknowledging the obvious challenges. A faster clock may require higher voltage levels, leading to increased dynamic power consumption, and higher speed libraries increase leakage power. However, the tradeoff between clock speed and pipelining is not always a total clearcut. Despite the potential increase in power consumption, a faster clock may still result in a performance advantage if the added pipelining latency is outweighed by the reduction in functional latency. Latency considerations factor in how you plan power states in PCIe. Fine-grained power state management can reduce power usage, but it also results in increased exit latencies, which can become more consequential when managing power aggressively.

Gary’s final point in managing latency is considering the use of CXL. This protocol is built based PCIe, while also supporting the standard protocol through CXL.io. CXL’s claim to fame is support for cache coherent communication through CXL.cache and CXL.mem. These interfaces offer much lower latency than PCIe. If you have need for coherent cache/memory access, CXL could be a good option.

Takeaways

Power consumption is a major concern in datacenters. The PCIe standard makes allowance for multiple power states to take advantage of opportunities to reduce power in the PHY and in the digital logic. Taking full advantage of the possibilities requires careful tradeoffs between optimization for latency, power, and throughput, all the way from software down to the PCIe physical layer. When suitable, CXL proves to be a promising solution, offering much lower latency compared to conventional PCIe.

Naturally ,Synopsys has production IP for PCIe (all the way up to Gen 6) and for CXL (all the way to CXL 3.0).

You can watch the webinar HERE.

Also Read:

PCIe 6.0: Challenges of Achieving 64GT/s with PAM4 in Lossy, HVM Channels

How to Efficiently and Effectively Secure SoC Interfaces for Data Protection

ARC Processor Summit 2022 Your embedded edge starts here!