You are currently viewing SemiWiki as a guest which gives you limited access to the site. To view blog comments and experience other SemiWiki features you must be a registered member. Registration is fast, simple, and absolutely free so please, join our community today!
DSP and AI are generally considered separate disciplines with different application solutions. In their early stages (before programmable processors), DSP implementations were discrete, built around a digital multiplier-accumulator (MAC). AI inference implementations also build on a MAC as their primitive. If the interconnect were programmable, could the MAC-based hardware be the same for both and still be efficient? Flex Logix says yes with their next-generation InferX reconfigurable DSP and AI IP.
Blocking-up tensors with a less complex interconnect
If your first thought reading that intro was, “FPGAs already do that,” you’re not alone. When tearing into something like an AMD Versal, one sees AI engines, DSP engines, and a programmable network on chip. But there’s also a lot of other stuff, making it a big, expensive, power-hungry chip that can only go in a limited number of places able to support its needs.
And, particularly in DSP applications, the full reconfigurability of an FPGA isn’t needed. Having large numbers of routable MACs sounds like a good idea, but configuring them together dumps massive overhead into the interconnect structure. A traditional FPGA looks like 80% interconnect and 20% logic, a point most simplified block diagrams gloss over.
Flex Logix CEO Geoff Tate credits his co-founder and CTO Cheng Wang with taking a fresh look at the problem. On one side are these powerful but massive FPGAs. On the other side sit DSP IP blocks from competitors that don’t pivot from their optimized MAC pipeline to sporadic AI workloads with vastly wider and often deeper MAC fields organized in layers.
Wang’s idea: create a next-generation InferX 2.5 tile built around a tensor processor unit, each with eight blocks of 64 MACs (INT8 x INT8) tied to memory and a more efficient eFPGA-based interconnect. With 512 MACs per TPU and 8192 MACs per tile, each tile delivers 16 TOPS peak at 1 GHz. It’s flipped the percentages: 80% of the InferX 2.5 unit is hardwired, yet it retains 100% reconfigurability. One tile in TSMC 5nm is a bit more than 5mm2, a 3x to 5x improvement over competitive DSP cores for equivalent DSP throughputs.
Software makes reconfigurable DSP and AI IP work
The above tile is the same for either DSP or AI applications – configuration happens in software.
The required DSP operations for a project are usually close to being locked down before committing to hardware. InferX 2.5, with its software, can handle any function: FFT, FIR, IIR, Kalman filtering, matrix math, and more, at INT 16×16 or INT 16×8 precision. One tile delivers 4 TOPS (INT16 x INT16), or in DSP lingo 2 TeraMACs/sec, at 1 GHz. Flex Logix codes a library that handles softlogic and function APIs, simplifying application development. Another footprint-saving step is an InferX 2.5 tile that can be reconfigured in less than 3usec, enabling a function quick-change for the next pipeline step.
AI configurations use the same tile with different Flex Logix software. INT 8 precision is usually enough for edge AI inference, meaning a single tile and its 16 tensor units push 16 TOPS at 1 GHz. The 3usec reconfigurability allows layers or even entire models to switch processing instantly. Flex Logix AI quantization, compilers, and softlogic handle the mapping for models in PyTorch, TensorFlow Lite, or ONNX, so application developers don’t need to know hardware details to get up and running. And, with the reconfigurability, teams don’t need to commit to an inference model until ready and can change models as often as required during a project.
Scalability comes with multiple tiles. N tiles provide N times the performance in DSP or AI applications, and tiles can run functions independently for more flexibility. Tate says so far, customers have not required more than eight tiles for their needs, and points out larger arrays are possible. Tiles can also be power managed – below, an InferX 2.5 configuration has four powered tiles and four managed tiles that can be powered down to save energy.
Ready to deliver more performance within SoC power and area limits
Stacking InferX 2.5 up against today’s NVIDIA baseline provides additional insight. Two InferX 2.5 tiles in an SoC check in around 10mm2 and less than 5W – and deliver the same Yolo v5 performance as a much larger external 60W Orin AGX. Putting this in perspective, below is super-resolution Yolo v5L6 running on an SoC with InferX 2.5.
Tate says what he hears in customer discussions is that transformer models are coming – maybe displacing convolutional and recurrent neural networks (CNNs and RNNs). At the same time, AI inference is moving into SoCs with other integrated capabilities. Uncertainty around models is high, while area and power requirements for edge AI have finite boundaries. InferX 2.5 can run any model, including transformer models, efficiently.
Whether the need is DSP or AI, InferX is ready for the performance, power, and area challenge. For more on the InferX 2.5 reconfigurable DSP and AI IP story, please see the following:
SoC test challenges arise due to the complexity and diversity of the functional blocks integrated into the chip. As SoCs become more complex, it becomes increasingly difficult to access all of the functional blocks within the chip for testing. SoCs also can contain billions of transistors, making it extremely time-consuming to test chips. As test time directly impacts test cost, minimizing test time is critical to managing the cost of a finished product. Automatic Test Pattern Generator (ATPG) is a crucial part of SoC testing, as it generates test patterns to detect faults in the design. However, the automation of ATPG is a challenging task, especially for complex SoCs, due to the large number of functional blocks and test points that need to be covered. Developing efficient and effective ATPG algorithms is a key challenge for SoC testing. But many of the ATPG tools today are not fully automated. Users have to learn all the commands and the options offered by the tools in order to use them effectively.
Is there a solution that brings some automation to the ATPG process, thereby enhancing engineering productivity? What if this solution also delivers significant savings in test time? Siemens EDA’s Tessent Streaming Scan Network (SSN) solution promises to deliver these benefits. This was substantiated by Intel, one of Siemens EDA’s customers during the recent User2User conference. Intel’s Toai Vo presented proof points based on his team’s experience with their first design using Tessent SSN solution. His team included Kevin Li, Joe Chou and Chienkuo (Tom) Woo.
Tessent SSN Solution
In a standard scan testing approach, test data is loaded into the circuit one bit at a time and shifted through the scan chains to observe the output responses. This process is repeated for each test pattern, which can be time-consuming and can lead to long test times. But the Tessent SSN solution packetizes test data to dramatically reduce DFT implementation effort and reduce manufacturing test times. By decoupling core-level and chip-level DFT requirements, each core can be designed with the most optimal compression configuration for that core. This solution can be used to efficiently test large and complex chips that have a high number of internal nodes that need to be tested. It uses a dedicated network to transmit test data in a streaming manner, enabling parallel processing of the data and thereby reducing test time.
Scalability
The Streaming Scan Network supports scalable scan architectures that can handle SoCs with a large number of functional blocks. The tool provides scalable approach of testing any number of cores concurrently while minimizing test time and scan data volume. Tessent SSN test infrastructure is built around the IEEE 1687/IJTAG standard for delivering greater flexibility and scalability to handle more complex designs and test scenarios.
Automation
The hierarchical object oriented nature of the test infrastructure lends itself for easier automation. Using Tessent infrastructure, a user can easily insert test logic into a chip. The process begins with the RTL design, where the SSN test logic is inserted using automation.
Test Time Savings
Using a traditional ATPG approach, normally only block can be run at a time which extends total test time. With the Tessent SSN ATPG approach, multiple blocks can be run in parallel, thereby greatly reducing the total test time. The following table shows the test time savings achieved by Toai’s team on their design.
Summary
Toai’s team found it very easy to migrate from a traditional embedded deterministic testing (EDT) channel based ATPG to a packet-based ATPG with SSN. The Tessent SSN solution greatly reduced engineering effort and silicon bring up time. And the test time reduction was significant compared to a traditional solution for testing. In Toai Vo’s words, it is absolutely an innovative test solution and it really works.
Dan is joined by Chris Morrison, Chris has 15 years’ experience in delivering innovative analog, digital, power management and audio solutions for International electronics companies, and developing strong relationships with key partners across the semiconductor industry. Currently he is the Director of Product Marketing at Agile Analog, the analog IP innovators. Previously he has held engineering positions, including 10 years at Dialog Semiconductor, now acquired by Renesas.
Chris details some of the new developments at Agile Analog, including foundry ecosystem expansion and new product introductions that are coming. Chris also explains the details behind how Agile Analog puts a digital “wrapper” around analog IP subsystems. The benefits of this approach for AMS integration are detailed, along with information about the targeted, customized delivery methodology used by Agile Analog.
The views, thoughts, and opinions expressed in these podcasts belong solely to the speaker, and not to the speaker’s employer, organization, committee or any other group or individual.
At the recent Synopsys Users Group Meeting (SNUG) I had the honor of leading a panel of experts on the topic of chiplets. One of those panelists was the very personable Dr. Henry Sheng, Group Director of R&D in the EDA Group at Synopsys. Henry currently leads engineering for 3DIC, advanced technology and visualization.
Are we seeing other markets move in this direction?
We’re seeing a broad movement for multi-die systems for some very good reasons. Early on some of the advantages were seen in the area of high performance computing (HPC) but now automotive is starting to adopt multi-die systems.
There are other technical motivations such as heterogeneous integration. If you migrate a design to the most advance process node, do you really need the entire system to be at that three nanometer node? Or do you implement the service functions of your system with some different technology node. Memory access has been another game changer where in the past you have to go through a board to get the memory, and then with interposers you can get much closer and with much higher bandwidth.
Stacking unleashes a lot of possibilities. It’s not necessarily just memories, but also applications such image sensors. Instead of taking data through a straw, eventually you’re getting to the point where it is raining down data into your compute die. I think there’s a lot to like about multi-die system, from a lot of different applications.
What other industry collaborations, IP, and methodologies are required to address the system-level complexity challenge?
There’s a lot of collaboration needed. John just mentioned the partnership that Synopsys has with ANSYS on system analysis. That kind of collaboration is really key. Back in the day, you had manufacturing, design and tooling all under one roof. And then over time – market forces and – market efficiencies pulled that apart into different enterprises. But while that’s economics, the nature of the technical problem is still very much intertwined. And if you look across this panel, you see a very tightly connected graph amongst all of us here. There’s a lot of collaborations that’s needed. And I think that’s pretty remarkable. I don’t know how many other industries that have this deep level of collaboration in order to mutually compete, but also to make progress.
You’ll see things like UCIe as a prime example. Standards are just the tip of the iceberg. Underneath that, there’s a whole lot of different collaborations needed to move the needle. More formalization, more standardization. This morning’s keynote called out a need for more standardization around chiplets.
And then with our friends at TSMC and 3DFabric and 3DBlox you’re starting to see what we’ve always seen in 2D in the emergence of formalization and alignment between different participants in the ecosystem. So I think it’s vital and I think we’ve always done it. So I’m pretty confident there’s a lot of rich material for collaboration and we will continue to come up with collaborative solutions.
How are the EDA design flows and the associated IP evolving and where do customers want to see them go?
It’s evolved a lot. It was mentioned earlier that multi-die system is not new. We started working on it probably 12 years ago. But it’s only recently where the commercial significance and the complexity has grown and evolved from more of a hobbyist type of environment earlier. Now it is become more of a professional environment, where what we’re trying to do is to evolve it from design methods a few years ago which basically revolved around assembly – sYou have components, and assemble components together. Now we’re getting into more of a multi-die system type of activity, going from an assembly problem to more of a design automation problem and trying to elevate it to where you’re now looking at designing the system together, because the chips are so co-dependent on each other. You can’t design the chiplets in isolation from each other because there’s a host of inter-related dependencies.
Principally where we are as an industry, we’ve invested decades of work into highly complex products and flows, and we don’t want to throw that away, right? You don’t want to disrupt that. You want to ride on top of that and augment it.
Where I see the EDA space going – we will continue to see a lot of the fine-grained optimizations that you would see in a traditional 2D problem space. Where I come from in Place and Route, you have a lot of very nice and almost convex problems that are fairly suitable to apply traditional techniques to solve them.
However, when you get to system level, these problems get kind of lumpy, and your solution space can become highly non-convex and difficult to solve with traditional techniques. That’s where looking into future on AI and ML and these kinds of things that can really help drive it forward.
So design has evolved from manual implementation, to computer-aided design, to electronic design automation, to AI-driven design automation. And probably in the future, instead of computer-aided design, maybe it becomes human-aided design. The AI will tell me “Hey Henry, I need that spec tightened up by next week. I need you to get that to me.“ With the complexity, you really need the automation in order to reasonably build and optimize these systems.
Do you see multi-die system as a significant driver for this technology moving forward?
Yes. For things like silicon lifecycle management that’s an emerging for 2D – if it’s important for 2D, it’s even more so for 3D.
If you look at it from the standpoint of yield, normally you look at 2D dies and there’s the concept of known good dies. So you can test before you put it on all in. But really if you look at a multi-die system, the system yield is the product of your yields, right? So even if you have all known good dies, you still have to put them together. And so there’s some multiplicative factors and you can roughly translate that same type of analysis over into the overall health of the system as well which depends on the multiplicative health of the components.
You have heterogeneous dies with known different properties, different workloads, and different behaviors across your different dies. So it becomes all the more important to be able to keep on top of that in monitoring.
Memories have always played a critical role, both in pushing the envelope on the semiconductor process development front and supporting the varied requirements of different applications and use-cases. The list of the various types of memories in use today runs long. At a gross level, we can classify memories into volatile or non-volatile, read-only or read-write, static or dynamic, etc. And when it comes to the cost, performance, power and area/form factor of an electronic system, a lot rides on the use of all the right memories for the application. The lion’s share of the attention is paid to the effective use of Static Random Access Memories (SRAMs) and Dynamic Random Access Memories (DRAMs) as per the tradeoff benefits to be derived. While the need for higher density memories that consume very low power and perform like SRAMs has always been there, applications were able to manage with a judicious mix of DRAMs and SRAMs.
But over the recent years, fast growing markets such as modem, edge connectivity and EdgeAI have started demanding more from memories. Additionally, with the rise of the Smart Internet of Things (IoT) and wearable technology, there is an increasing demand for memory solutions that can provide high performance and low power consumption to extend battery life. These applications want memories that deliver the performance and power benefits of SRAMs (over DRAMs) and the density and cost benefits of DRAMs (over SRAMs) rolled into one. Fortunately, such a type of memory was invented quite a while ago and is called the Pseudo Static Random Access memory (PSRAM). PSRAM manufacturers were waiting in the wings for adoption drivers such as the above mentioned fast growing applications. The list of PSRAM memory suppliers includes AP Memory, Infineon, Micron Technology, Winbond Technology, and others.
What is PSRAM? [Source: JEDEC.org]
(1) A combinational form of a dynamic RAM that incorporates various refresh and control circuits on-chip (e.g., refresh address counter and multiplexer, interval timer, arbiter). These circuits allow the PSRAM operating characteristics to closely resemble those of a SRAM.
(2) A random-access memory whose internal structure is a dynamic memory with refresh control signals generated internally, in the standby mode, so that it can mimic the function of a static memory.
(3) PSRAMs have nonmultiplexed address lines and pinouts similar to SRAMs.
Mobiveil
Mobiveil is a fast-growing technology company that specializes in the development of Silicon Intellectual Properties, platforms and solutions for various fast growing markets. Its strategy is to grow with fast burgeoning markets by offering its customers valuable IPs that are easy to integrate into SoCs. One such IP is Mobiveil’s PSRAM Controller which has been in mass production for more than half-a-decade with customers across the US, Europe, Israel and China. The controller is available in different system bus flavors such as AXI and AHB and supports a variety of PSRAM and HyperRAM devices from many suppliers. The company recently expanded the list with the addition of support for AP Memory’s latest 250MHz PSRAM devices.
AP Memory
AP Memory is a world leader in PSRAM and has shipped more than six-billion PSRAM devices to date. The company has positioned itself as a market leader in PSRAM devices, providing a complete product line of high-quality memory solutions to support IoT and wearables market segments. The company continuously launches competitive products and provides customized memory solutions based on customer requirements.
Mobiveil-AP Memory Partnership
This partnership expects to bring significant benefits for SoCs, as PSRAM devices offer 10x higher density over eSRAM, 10x lower power compared to standard DRAM, and close to 3x fewer pin count. These advantages will result in lower power consumption, higher performance, and cost savings for the systems that leverage PSRAMs.
The result of the partnership is a controller IP that will provide cost-effective, ultra-low-power memory solutions for system designers. Mobiveil has adapted its PSRAM Controller to interface with AP Memory’s new PSRAM device that goes up to 250 MHz in speed and densities from 64Mb to 512Mb, supporting x8/x16 modes. This integration will allow SoC designers to take advantage of the high performance of the PSRAM controller at very low power, making it ideal for battery-operated applications, and extending the standby time of devices.
The PSRAM controller supports Octal Serial Peripheral Interface (Xccela standard), enabling speeds of up to 1,000 Mbytes/s for a 16-pin SPI option. Additionally, it provides support for a direct memory mapped system interface, automatic page boundary handling, linear/wrap/continuous/hybrid/burst support, and low power features like deep and half power down.
Mobiveil’s flexible business models, strong industry presence through strategic alliances and key partnerships, dedicated integration support, and engineering development centers located in Milpitas, CA, Chennai, Bangalore, Hyderabad and Rajkot, India, and sales offices and representatives located worldwide, have added tremendous value to customers in executing their product goals within budget and on time. To learn more, visit www.mobiveil.com.
There were quite a few announcements at the TSMC Technical Symposium last week but the most important, in my opinion, were based on TSMC N3 tape-outs. Not only is N3 the leading 3nm process it is the only one in mass production which is why all of the top tier semiconductor companies are using it. TSMC N3 will be the most successful node in the history of the TSMC FinFET family, absolutely.
(Graphic: TSMC)
In order to tape-out to 3nm you need IP and high speed SerDes IP is critical for HPC applications such as AI which is now the big semiconductor driver for leading edge silicon. Enabling chiplets at 3nm is also a big deal and that is the focus of this well worded announcement:
Successful launch of 3nm connectivity silicon brings chiplet-enabled custom silicon platforms to the forefront Alphawave Semi 3nm Eye Diagram
(Graphic: Business Wire)
LONDON, United Kingdom, and TORONTO, Canada – April 25, 2023 – Alphawave Semi (LSE: AWE), a global leader in high-speed connectivity for the world’s technology infrastructure, today announced the bring-up of its first connectivity silicon platform on TSMC’s most advanced 3nm process with its ZeusCORE Extra-Long-Reach (XLR) 1-112Gbps NRZ/PAM4 serialiser-deserialiser (“SerDes”) IP.
An industry-first live demo of Alphawave Semi’s silicon platform with 112G Ethernet and PCIe 6.0 IP on TSMC 3nm process will be unveiled at the TSMC North America Symposium in Santa Clara, CA on April 26, 2023.
The 3nm process platform is crucial for the development of a new generation of advanced chips needed to cope with the exponential growth in AI generated data, and enables higher performance, enhanced memory and I/O bandwidth, and reduced power consumption. ZeusCORE XLR Multi-Standard-Serdes (MSS) IP is the highest performance SerDes in the Alphawave Semi product portfolio and on the 3nm process will pave the way for the development of future high performance AI systems. It is a highly configurable IP that supports all leading edge NRZ and PAM4 data center standards from 1112 Gbps, supporting diverse protocols such as PCIe Gen1 to Gen6 and 1G/10G/25G/50G/100 Gbps Ethernet.
This flexible and customizable connectivity IP solution together with Alphawave Semi’s chiplet-enabled custom silicon platform which includes IO, memory and compute chiplets, allows end-users to produce high performance silicon specifically tailored to their applications. Customers can benefit from Alphawave Semi’s application optimized IP-subsystems and advanced 2.5D/3D packaging expertise to integrate advanced interfaces such Compute Express Link (CXLTM), Universal Chiplet Interconnect ExpressTM (UCIeTM), High Bandwidth Memory (HBMx), and Low-Power Double Data Rate DRAM (LP/DDRx/) onto custom chips and chiplets.
“Alphawave Semi continues to see growing demand from our hyperscaler customers for purpose-built silicon with very high-speed connectivity interfaces, fueled by an exponential increase in processing of AI-generated data”, said Mohit Gupta, SVP and GM, Custom Silicon and IP, Alphawave Semi. “We’re engaging our leading customers on chiplet-enabled 3nm custom silicon platforms which include IO, memory, and compute chiplets. Our Virtual Channel Aggregator (VCA) partnership with TSMC has provided invaluable support, and we look forward to accelerating our customers’ high-performance designs on TSMC’s 3nm process.”
About Alphawave Semi
Alphawave Semi is a global leader in high-speed connectivity for the world’s technology infrastructure. Faced with the exponential growth of data, Alphawave Semi’s technology services a critical need: enabling data to travel faster, more reliably and with higher performance at lower power. We are a vertically integrated semiconductor company, and our IP, custom silicon, and connectivity products are deployed by global tier-one customers in data centers, compute, networking, AI, 5G, autonomous vehicles, and storage. Founded in 2017 by an expert technical team with a proven track record in licensing semiconductor IP, our mission is to accelerate the critical data infrastructure at the heart of our digital world. To find out more about Alphawave Semi, visit: awavesemi.com.
I’ve been following Solido as a start-up EDA vendor since 2005, then they were acquired by Siemens in 2017. At the recent User2User event there was a presentation by Kwonchil Kang, of Samsung Electronics on the topic, ML-enabled Statistical Circuit Verification Methodology using Solido. For high reliability circuits there is a high-sigma requirement, and 6 sigma equates to 10 failures per 10,135,946,920 samples, or simulations. Using multiple Process, Voltage and Temperature (PVT) corners creates even more simulations. Using a brute-force approach to reach high-sigma by Monte Carlo simulations simply takes too much time.
There is a reduced Monte Carlo approach that tries to scale to 6-sigma, but for a bandgap reference circuit example with 36 PVT corners it requires 3,000 simulations per PVT corner, or 108,000 simulations for all 36 PVT corners, and the limited accuracy comes as long tail or non-gaussian characteristics are introduced.
The Solido approach uses Artificial Intelligence (AI) for variation-aware design and verification with Solido Variation Designer, and there are two components:
PVTMC Verifier – finds worst-case corner for target sigma and design sensitivities to variation
High-Sigma Verifier – High-sigma verification 1,000 to 1,000,000,00 faster than brute-force simulation
There are several steps to the AI algorithm used in the Solido tools, and the first step is to generate Monte Carlo samples, but don’t simulate them. Next, simulate initial samples, and then sort all of the samples and simulate them in order. Simulating even more samples will then capture the true yield at a target sigma.
Simulate samples around target sigma
With this Solido AI approach, and the resulting Probability Density Function (PDF) would look like the example below:
Probability Density Function
Probability Density Function
The dashed blue line is the verified PDF fit. Green dots are the initial samples, and dark dots the Monte Carlo results. The orange dots are ordered samples.
For the actual bandgap reference circuit described in the presentation, Solido Variation Designer achieved verification equivalent to 10 billion brute-force simulations in just 24,100 simulations translating to a speed-up of some 415,000X.
PVTMC Verifier covers all PVT corners and runs Monte Carlo in a way that requires only a few hundred simulations to capture the target sigma, thus reducing the number of simulations across the corners. The results are accurate as there are no extrapolations used or Gaussian assumptions, because it’s using real simulations at the target sigma. It’s covering all PVTs in a single run of the tool.
PVTMC Verifier example results
Inside of the PVTMC Verifier it’s identifying ordinals classes for all PVTs, capturing a distribution for each class, then verifying distributions within known classes. On the bandgap reference circuit described in the presentation, PVTMC Verifier ran a 6-sigma verification across all 36 PVTs in just 11,000 simulations, a speed-up of 32,000,000 compared to brute-force Monte Carlo.
The tool flow for using Solido AI is that a circuit netlist is run through PVTMC Verifier to select the worst-case statistical points, simulates the samples at multiple scales, observe the response to a change in scale, then it builds a model to predict the unscaled yield estimate. These first-pass results are then sent to the high-sigma verifier which runs initial samples until model building is successful, uses AI to generate Monte Carlo samples, then runs tail samples until the result is verified.
Using the Solido AI methodology required only 300 simulations per PVT with PVTMC Verifier (10,800 simulations) plus 20,000 simulations with High-Sigma Verifier, so a total of 34,900 simulations. The accuracy matched brute-force Monte Carlo, however the results completed 10,000,000X faster
Summary
At Samsung they are using Solido AI technology to achieve their goals of high-sigma verification across IC applications, while having much shorter run times than using brute-force Monte Carlo simulations. They used PVTMC Verifier to give first-pass results across all PVTs, then followed with High-Sigma Verifier for the final verification on critical and worst-case PVTs.
MOSFET gate resistance is a very important parameter, determining many characteristics of MOSFETs and CMOS circuits, such as:
• Switching speed
• RC delay
• Fmax – maximum frequency of oscillations
• Gate (thermal) noise
• Series resistance and quality factor in MOS capacitors and varactors
• Switching speed and uniformity in power FETs
• Many other device and circuit characteristics
Many academic and research papers have been written about gate resistance. However, for practical work of IC designers and layout engineers, many important things have not been discussed or explained, for example:
• Is gate resistance handled by SPICE models or by parasitic extraction tools?
• How do parasitic extraction tools handle gate resistance?
• How can one evaluate gate resistance from the layout or from extracted, post-layout netlist?
• How can one identify if gate resistance is limited by the “intrinsic” gate resistance (gate poly), or by gate metallization routing, and what are the most critical layers and polygons?
• Is gate distributed effect (factors of 1/3 and 1/12, for single- and double-contacted poly) captured in IC design flow (in PDK)?
• Is vertical gate resistance component captured in foundry PDKs?
• Should the gate be made wider or narrower, to reduce gate resistance?
• What’s the difference between handling gate resistance in PDKs for RF versus regular MOSFETs or p-cells?
The purpose of this article is to demystify these questions, and to provide some insights for IC design and layout engineers to better understand gate resistance in their designs.
Gate resistance definition and measurement
Gate resistance is an “effective” resistance from the driving point (gate port, or gate driver), to the MOSFET gate instance pin(s) – see Fig.1. (instance pin is a connection point between a terminal of SPICE model and resistive network a net).
Figure 1. MOSFET cross-section and schematic illustration of gate resistance.
However, the simplicity of the schematic in Figure 1 may be very misleading. Gate nets can be very large in size, contain many driving points, many (dozens of) layers (metal and via), millions of polygons, and up to millions of gate instance pins (connection points for device SPICE model gate terminals) – see Figure 2.
Figure 2. Schematic illustration of the top-view and cross-sectional view of MOSFET gate network
Gate network forms a large distributed system, with one or several driving points, and many destination points.
Very often, gate net looks and behaves as a huge, regular clock network, distributing the gate voltage to a FET.
Deriving an equivalent, effective gate resistance for such a large and complex system is not a simple and straightforward task. SPICE circuit simulation does not explicitly report gate resistance value.
Knowing the value of gate resistance is very useful to estimate the speed of switching, delay, noise, Fmax, and other characteristics, to see if characteristics are within the spec. Also, knowing the contributions to the gate resistance – by layer, and by layout polygons – is very useful to guide the layout optimization efforts.
Gate resistance handling by parasitic extraction tools
To understand gate resistance in IC design flow, it’s important to know how parasitic extraction tools treat and model it.
All industry-standard parasitic extraction tools handle gate resistance and its extraction similarly. In layout, the MOS gate structure is represented by a 2D mask traditionally called “poly” – even though the material can be formed by a complex gate metal stack and may have a complex 3D structure.
They fracture the poly line at the intersection with the active (diffusion) layer, breaking it into “gate poly” (poly over active) and “field poly” (poly outside active), as shown in Figure 3.
Figure 3. R and RC extraction around MOSFET gate.
Gate poly is also fractured at the center point. Gate instance pin of the MOSFET (SPICE model) is connected to the center point of the gate poly. Gate poly is described by two parasitic resistors, connecting the fracture points. A more accurate model of the gate poly, with two positive and one negative resistor, can be enabled in the PDK, but some foundries prefer not to use it (see next section on Gate Delta Model).
Parasitic resistors representing the field poly are connected to the gate contacts or to MEOL (Middle-End-Of-Line) layers and further to upper metal layers.
MOSFET extrinsic parasitic capacitance between gate poly and source / drain diffusion and contacts is calculated by parasitic extraction tools, and assigned to the nodes of the resistive networks. Different extraction tools do this differently – some tools connect these parasitic capacitances to the center point of the gate poly, while some other tools connect them to the end points of the gate poly resistors. The details of the parasitic capacitance connection to the gate resistor network may have a large, significant impact on transient and AC response, especially in advanced nodes (16nm and lower), where gate parasitic resistance is huge.
These details can be seen in the DSPF file, but are not usually discussed in the open literature or in foundry PDK documentation. Visual inspection of text DSPF files is tedious and requires some expertise. Specialized EDA tools (e.g ParagonX [3]) can be used to visualize RC networks connectivity for post-layout netlists (DSPF, SPEF), probe them (see and inspect R and C values), perform electrical analysis, and do other useful things.
Delta gate model
MOSFET gate forms a large distributed RC network along the gate width – shown in Figure 4.
This distributed network has a different AC and transient response than a simple lumped one-R and one-C circuit. It was shown [2-3] that such RC network behaves approximately the same as a network with one R and one C element, where C is the total capacitance, and R=1/3 * W/L *rsh for single-side connected poly, and R=1/12 * W/L * rsh for double-sided connected poly. These coefficients – 1/3 and 1/12 – effectively enable an accurate reduced order model for the gate, reducing a large number of R and C elements to two (or three) resistors and one capacitor.
To enable these coefficients in a standard RC netlist (SPICE netlist or DSPF), some smart folks invented a so-called Gate Delta Model – where a gate is described by two positive and one negative resistors – see Figure 5.
Figure 5. MOSFET Delta gate model.
Some SPICE simulators have problems handling negative resistors, that’s possibly why this model did not get a wide adoption. Some foundries and PDKs support delta gate model, while some others don’t.
Many people get surprised when they see negative resistors in DSPF files. If these resistors are next the gate instance pin – they are a part of the gate delta circuit.
Distributed effects along the gate length (in the direction from source to drain) are usually ignored at the circuit analysis level, due to a small value of gate length as compared to gate width.
Impact of interconnect parasitics on gate resistance
In “old” technologies, metal interconnects (metals and vias) had a very low resistance, and gate resistance was dominated by gate poly. The analysis and calculation of gate resistance was very simple.
In the latest technologies (e.g. 16nm and lower), interconnects have very high resistance, and can contribute a significant fraction (50% or more) to the gate resistance. Depending on the layout, gate resistance may have significant contributions from any layers – devices (gate poly, field poly), MEOL, or BEOL.
Figure 6 shows the results of gate resistance simulation using ParagonX [3]. Pareto chart with resistance contributions by layer helps identify the most important layers for gate resistance. Visualization of contributions by layout polygons to the gate resistance immediately points to the choke points, bottlenecks for gate resistance, that is very useful to guide layout optimization efforts.
Figure 6. Simulation results of gate resistance: (a) Gate resistance contribution by layer, and (b) contribution by polygons shown by color over the layout.
Gate resistance in FinFETs
In planar MOSFETs, the gate has a very simple planar structure, and the current flow in the gate is one-dimensional, along the direction of the gate width.
In FinFET technologies, the gate wraps around very tall silicon fins, and hence has a very complicated 3D structure. Further, gate material is selected based on the work function, to tune the threshold voltage (threshold voltage in FinFETs is tuned not by the channel doping, but by gate materials). These materials have very high resistance, much higher than solicited poly (which has typical sheet resistivity of ~10 Ohm/sq). The gate may be formed by multiple layers – interface layer with silicon, and one or more layers above it.
However, all these details are abstracted from the IC designers and layout engineers, and they see usual polygons for “poly” and for “active” – which makes design work much easier.
Handshake between SPICE model and parasitic extraction
In general, both SPICE models and parasitic extraction tools take gate resistance into account. Parasitic extraction is considered a more accurate method of calculating parasitic R and C values around the devices, since it “knows” (unlike SPICE) about the layout.
To avoid parasitic resistance and capacitance double-counting (in SPICE model and in parasitic extraction), there is a mechanism of a hand-shake between SPICE modeling and parasitic extraction, based on special instance parameters.
Regular device vs RF Pcell compact models
Regular MOSFET SPICE models do not describe gate resistance accurately enough for high frequencies, high switching speeds, or for RF or noise performance. To enable high simulation accuracy, the foundries usually recommend using RF P-cells, that have fixed size, that contain a shield (guard rings and metal cages), and that are described by high-accuracy models derived from measurements. However, these RF P-cells have a much larger area than standard MOSEFTs, and many designers prefer to use standard MOSFETs, to reduce area.
Vertical component of gate resistance
In “old” technologies (pre-16nm), gate resistance was dominated by lateral resistance. However, in advanced technologies, multiple interfaces between gate material layers lead to a large vertical gate resistance. This resistance is inversely proportional to the area of the gate poly. It can be modeled as an additional resistor connecting gate instance pin to the center point of the gate poly – see Figure 7(a). As a result, when the gate gets narrower (smaller number of fins), gate resistance goes down, but increases at very small gate widths. It displays a characteristic non-monotonic behavior, as seen in Figure 7(b). The old rule of thumb where “the narrower gate has lower gate resistance” does not work any more. Designers and layout engineers have to select the optimum (non-minimal) gate width (number of fins), to minimize gate resistance.
Figure 7. (a) Gate model accounting for vertical gate resistance, and (b) measured and simulated gate resistance versus number of fins (ref. [2]).Depending on technology, on PDK, and on foundry, the vertical gate resistance may or may not be included into parasitic extraction. It’s very easy to check this in DSPF file – if gate instance pin is connected directly to the center of the gate poly – vertical resistance is not accounted for. If it is connected by a positive resistor to the center of the gate poly – that resistors represents the vertical gate resistance.
Technology trends
With technology scaling, both gate resistances and interconnect resistances increase significantly – by up to one or two orders of magnitude. As a result, the details of the layout that were not important for gate resistance in older nodes, become very important in advanced nodes.
Other MOSFET gate-like structures
While the discussion on gate resistance in this article is focused on MOSFETs, the same arguments and approaches are applicable to other distributed systems controlled by the gate or by gate-like systems, such as:
• IGBTs (Insulated Gate Bipolar Transistors)
• Decoupling capacitors
• MOS capacitors
• Varactors
• Deep trench and other MIM-like integrated capacitors
Figure 8 shows a gate structure of a vertical MOSFET, and gate delay distribution over the device area, simulated using ParagonX [3].
Figure 8. (a) Typical layout of vertical FET, IGBT, and other gate-controlled devices. (b) Distribution of gate resistance and delay over area.
References
1. B. Razavi, et al., “Impact of distributed gate resistance on the performance of MOS devices,” IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 41, pp. 750-754, 11 1994.
2. A.J.Sholten et al., “FinFET compact modelling for analogue and RF applications”, IEDM’2010, p.190.
3. ParagonX User Guide, Diakopto Inc., 2023.
Performance, Power and Area (PPA) metrics are the driving force in the semiconductor market and impact all electronic products that are developed. PPA tradeoff decisions are not engineering decisions, but rather business decisions made by product companies as they decide to enter target end markets. As such, the sooner a company knows if a certain PPA can be achieved, the better it is for business planning purposes alongside chip development work. The worst thing to happen is for a company to realize many months into the development phase that the desired PPA cannot be achieved. Hence, companies seek to establish optimal PPA for a chip as early as possible in the development process. Placement happens to be such a stage as the physical elements have more or less been finalized. But achieving that goal is not that simple given the several challenges related to PPA that need to be addressed to achieve a successful design.
What if there is a way to achieve optimal PPA at placement stage and carry it through to signoff, in spite of the above mentioned challenges. Siemens EDA’s digital implementation solution Aprisa promises to deliver that benefit and more. This was substantiated by MaxLinear, one of Siemens EDA’s customers during the recent User2User conference. Ravi Ranjan, MaxLinear’s Director of Physical Design presented proof points based from real life experience with N16 and N5 process based design implementations. MaxLinear plans to extend the adoption of Aprisa for new process nodes on future projects.
Excellent Correlation
One common reason for changing the placement after routing is to fix timing violations. The design needs to meet the required timing constraints while minimizing the delay and maximizing the clock frequency. This requires careful optimization of the placement and routing to ensure that the critical paths are optimized, and the timing constraints are met. Another reason for changing placement after routing is to optimize power consumption. The design needs to minimize the power consumption while still meeting the performance requirements. This requires careful power optimization techniques such as clock gating, power gating, and voltage scaling to minimize the power consumption. Yet another reason for changing placement after routing stage is to address routing congestion. Too many wires or interconnects that need to be routed through a limited space, result in routing difficulties or a suboptimal routing solution.
Closure is the process of meeting all design requirements, such as timing, power, and area, while also ensuring that the design is manufacturable.
Excellent correlation of timing, latency and skew through the placement, clock tree synthesis (CTS) and routing stages are indicative of placement stage PPA carrying through to successful routing. As an example, the following Figure shows pre-route vs post-route signal net length and clock net length correlation sample from a N16 based design.
Automated Flow Setups
Place and route (P&R) tools and methodologies typically need to be adapted for each new technology node to achieve the best PPA for a target process. The reason for this is that each process technology has unique characteristics that can significantly impact the P&R process. To achieve the best results, P&R tools need to be specifically calibrated and optimized for a process node. Typically, this step calls for engineering expertise and prior experience and involves trial-and-error for adoption.
Aprisa’s FlowGen capability reduces the effort needed to setup for new technology and adapt the design flow as well. MaxLinear found it very easy to adapt their flow when moving from N16 to N5 designs. The Aprisa FlowGen supports a wide range of design types including SoC, CPU, timing critical and congestion critical ones.
Summary of Aprisa Benefits
Placement stage optimal PPA maintained through to signoff
Anirudh is an engaging speaker with a passion for technology. Acknowledging the sign of the times, he sees significant value-add in AI but reminded us that it is a still supporting actor in system design and other applications where star roles will continue to be played by computational software that’s founded in hard science, math, and engineering technologies. This is Cadence’s singular focus—continuing to advance computational software methods in EDA and other domains while leveraging AI techniques where appropriate.
Market drivers
I was talking to an analyst recently who thought that because manufacturing activity is down, semiconductor design must also be suffering. So, I’m not surprised that Anirudh kicked off the discussion with a nod to this being a tough year for semi revenues following multiple years of massive growth, attributing the correction to over-stuffed inventories.
And yet (maybe someday analysts will understand this), design continues to be strong, in part because design cycles are much longer than manufacturing cycles, in part because no one can afford to come out of a downturn without new products ready to launch, and in part because design starts continue to grow as systems companies, now delivering 45% of Cadence business, are accelerating their own design activity.
Anirudh believes that the semi-industry will reach $1 trillion in revenue by the end of the decade, a 2X growth from today, and that electronics and related systems will reach $3 trillion in revenue in a similar period. Manufacturing may be in a slump right now, but we already have an appetite for hyperscale computing, 5G, autonomous vehicles, AI and industrial IoT. That appetite won’t disappear, so manufacturing will surge back at some point. The winners at that point will be companies ready with new designs. Cadence is very optimistic about the long-term tailwinds behind design for such products.
Computational software and AI
Computational software is all about precision at a massive scale of complexity of the object to analyze (billions to soon trillions of transistors) and in the nature of the analysis (PPA optimization and/or multi-physics). Foundational methods are grounded in well-established hard science advances like finite element analysis with origins in the late 1800’s. Maxwell’s electromagnetic equations from a similar period, thermal diffusion first described by Fourier even earlier, and so on and so on. The EDA industry has been developing technologies over at least the last 50 years.
In contrast, AI is all about probabilistic inferencing, delivering impressive responses with, say, 97% certainty, in some cases better than we can manage. But, at the same time, we don’t want to hear that in 97% of cases our car won’t crash, or the robot surgeon won’t make a mistake. We want the precision and reliability of computational software in building and analyzing systems with AI as a layer on top to help us explore more implementation options.
Developing and maintaining that technology is not cheap. Cadence has about 10,000 people with 65% in R&D and 25% in customer support. 90% of its staff are in engineering, which is comforting to know. Designs built using these technologies will be reliable, safe, secure, and eco-friendly. But how does that scale? By 2030, designs are expected to grow by at least 10X in transistor count. Technology companies are already struggling to add staff, but none can afford to grow staffing by 10X. We need to become even more efficient by abstracting architecture design to higher levels, parallelizing even more, and relying more on AI-assisted decision-making.
AI in Cadence products
Reinforcement learning has become a dominant technique for optimization in EDA. One significant advantage is that it doesn’t require gradient-based estimation to find good search directions to advance. Gradient methods work well when optimization metrics vary relatively smoothly and can be computed quickly but not if they vary rapidly or take hours to compute on each change. Cadence has been talking over the past couple of years about advances in AI with the delivery of products like Cadence Cerebrus, Verisium and Optimality, which all utilize reinforcement learning over multiple runs to guide optimization. These are all cases where computing metrics with precision may take hours, making reinforcement learning essential to advance optimization options.
Evidently results are impressive, as judged by numbers Anirudh shared. There have already been 180+ Cadence Cerebrus tapeouts.
Last month, Cadence announced Virtuoso Studio, covered in more detail by my colleague Daniel Payne. Briefly, this offers more place-and-route support in analog and a claimed 3X productivity advance for designers. There’s more support for heterogenous integration in 2.5D/3D packaging, adding analog and RF into the same package. It also includes integrations with the digital design tool suite, integrity analysis, multi-die packaging, AWR analysis and multi-physics analysis for thermal, electromagnetics, etc.
Cadence also recently announced Allegro X AI for PCB and package design, which it automates placement and routing and reduces physical layout and analysis challenges. For 3D-IC Cadence offers Integrity, starting development in Allegro back in 1995, long before most of us had even heard of chiplets (remember system in package and modules?). Around 2015, more capabilities were developed, though the industry was still not quite ready. More recently, Cadence has been working very closely with its foundry partners to refine Integrity support leading to their latest AI-driven 3D-IC solution.
Onward and upward
It’s easy to see the computational software focus in everything I described above—from chip design and analysis to package and system design and analysis. Where is Cadence going with some of their recent acquisitions? In May of last year, Cadence announced a partnership with the McLaren Formula 1 team who are looking to its Fidelity CFD software to optimize aerodynamics for McLaren’s race cars.
Cadence acquired Future Facilities about a year ago. They provide electronics cooling analysis and energy performance optimization solutions for data center design and operations using physics-based 3D digital twins. Just to prove they are even more versatile, only a few days ago, Cadence announced a partnership with the 49ers, to evaluate ways to optimize energy efficiency and sustainability at Levi’s stadium.
Last year, Cadence also made an investment in molecular sciences company, OpenEye Scientific. Anirudh is very excited about this, seeing huge synergy in simulating molecules. He sees (of course) significant similarities between OpenEye simulation and the Cadence Spectre platform with physics models for molecules looking rather like BSIM models for circuit simulation!
Energizing stuff. I look forward to next year’s update.