You are currently viewing SemiWiki as a guest which gives you limited access to the site. To view blog comments and experience other SemiWiki features you must be a registered member. Registration is fast, simple, and absolutely free so please, join our community today!
Masks have always been an essential part of the lithography process in the semiconductor industry. With the smallest printed features already being subwavelength for both DUV and EUV cases at the bleeding edge, mask patterns play a more crucial role than ever. Moreover, in the case of EUV lithography, throughput is a concern, so the efficiency of projecting light from the mask to the wafer needs to be maximized.
Conventional Manhattan features (named after the Manhattan street blocks or the lit building windows in the evening) are known for their sharp corners, which naturally scatter light outside the numerical aperture of the optical system. In order to minimize such scattering, one may to turn to Inverse Lithography Technology (ILT), which will allow curvilinear feature edges on the mask to replace sharp corners. To give the simplest example where this may be useful, consider the target optical image (or aerial image) at the wafer in Figure 1, which is expected from a dense contact array with quadrupole or QUASAR illumination, resulting in a 4-beam interference pattern.
Figure 1. A dense contact image from quadrupole or QUASAR illumination, resulting in a four-beam interference pattern.
Four interfering beams cannot produce sharp corners at the wafer, but a somewhat rounded corner (derived from sinusoidal terms). A sharp feature corner on the mask would produce the same roundness, but with less light arriving at the wafer; a good portion of the light has been scattered out. A more efficient transfer of light to the wafer can be achieved if the mask feature has a curvilinear edge with the same roundness, as in Figure 2.
Figure 2. Mask feature showing curvilinear edge similar to the image at the wafer shown in Figure 1. The edge roundness ideally should be the same.
The amount of light scattered out can be minimized to 0 ideally with curvilinear edges. Yet despite the advantage of curvilinear edges, it has been difficult to make masks with these features, as curvilinear edges require more mask writer information to be stored compared to Manhattan features, reducing the system throughput from the extra processing time. The data volume required to represent curvilinear shapes can be an order of magnitude more than the corresponding Manhattan shapes. Multi-beam mask writers, which have only recently become available, compensate the loss of throughput.
Mask synthesis (designing the features on the mask) and mask data prep (converting the said features to the data directly used by the mask writer) also need to be updated to accommodate curvilinear features. Synopsys recently described the results of its curvilinear upgrade. Two highlighted features for mask synthesis are Machine Learning and Parametric Curve OPC. Machine learning is used to train a continuous deep learning model on selected clips. Parametric Curve OPC represents curvilinear layer output as a sequence of parametric curve shapes, in order to minimize data volume. Mask data prep comprises four parts: Mask Error Correction (MEC), Pattern Matching, Mask Rule Check (MRC), and Fracture. MEC is supposed to compensate errors from the mask writing process, such as electron scattering from the EUV multilayer. Pattern matching operations search for matching shapes and becomes more complicated without restrictions to only 90-deg and 45-deg edges. Likewise, MRC needs new rules to detect violations involving curved shapes. Finally, fracture needs to not only preserve curved edges but also support multi-beam mask writers.
“Strategy” is a word sometimes used loosely to lend an aura of visionary thinking, but in this context, it has a very concrete meaning. Without a strategy, you may be stuck with decisions you made on a first-generation design when implementing follow-on designs. Or face major rework to correct for issues you hadn’t foreseen. Making optimum architecture decisions for the series at the outset is key. Will it support replicating a major subsystem allowing more channels in premium versions, for more sensors or more video streams? Can the memory subsystem scale to support increased demand? Careful planning and modeling, checking target bandwidths and latencies is a necessary starting point. However architectural feasibility alone may not be sufficient to ensure scalability for one critical component – the interconnect between the function blocks in the design.
Strategies and risks for interconnect
The startup strategy. Starting with no design infrastructure, part of your funding must be committed to design tools and essential IP. Some CPU cores come with low-cost access to an interconnect generator based on a crossbar technology. Or perhaps you decide to build your own generator – how hard can that be?
This strategy may work well on the first-generation design. Crossbar-based interconnect is well-established for entry-level designs but exhibits a glaring scalability weakness as systems become more complex. Area consumed by interconnect grows rapidly as the number of initiators and targets grows, creating more challenges for bandwidth, latencies and layout congestion. Problems become acute in follow-on designs as target and initiator counts increase to merge multiple market demands into a common product. Designs must also be as robust as possible to IP changes. A home-grown bus fabric may have worked well with the IP portfolio for the launch design, but what if one IP fails to measure up in the next product? A workaround may be possible but would kill your margins. A better IP is available but only with an interface you don’t yet support. Designing and fully verifying a new protocol will take more time than you have in the critical path to product release.
If you are going to use a crossbar interconnect in your first-pass design, set clear expectations that this will be a proof-of-concept build. It is already widely accepted that scalable interconnect must be based on NoC technology; to transition to a scalable market product, it is almost certain you will have to redesign around that technology. Commercial NoC IP generators already support the full range of AMBA and other protocol standards, limiting risk if needing to change IP. Then again, you could just start with a NoC, avoiding later risks.
The “What we have works and change adds risk” strategy.
Risk in change is an understandable concern but must be balanced against other risks. If it was tough to close timing on your last design and your next design will be more complex, you may be able to battle through and make it work, but at what cost? Pride in surviving the challenge will dissipate quickly if PPA is compromised.
This is not a hypothetical concern. One large company planned to reduce total system cost by designing out a chip they were buying externally. They already had all the tooling and expertise needed to make this happen. The plan seemed like such a no-brainer that they built this expectation into forward projections to analysts – improved margins at more competitive pricing. But they couldn’t close timing at target PPA on their in-house replacement. To continue to deliver the larger system, they were forced to extend their contract with the existing external supplier. Missing projections and getting a black eye. For the next generation, they switched to a commercial NoC solution and were able to complete the design-out successfully.
The “Our interconnect is differentiating” strategy.
There are a few system architectures for which interconnect architecture must be quite special, commonly for mesh networks or more exotic topologies like a torus. Applications demanding such topologies are typically high-premium multi-core server systems, GPUs and AI training systems. Even here, commercial NoC generators have caught up, to the point that market-leading AI systems companies now routinely use these NoCs. Suggesting that fundamentally, differentiation even in these high-end designs is not in the NoC. Just as for other IP, the trend is to commercial solutions for all the usual reasons: Maybe initially comparable to the in-house option but proven across an industry-wide range of SoCs, continually enhanced to remain competitive, with lower total cost of ownership, always-on support and resilient to expert staff turnover.
In a challenging economic climate, it has become even more important for us to pick our strategic battles carefully. People who work on NoC design are often among the best designers in the company. Where is the best place to use those designers? In further securing your lead in truly differentiating features, or in continuing to support NoC technology you can buy off-the-shelf?
If these arguments pique your interest, take a look at Arteris’ FlexNoC and Ncore Cache Coherent interconnect IPs. They boast over 3 billion Arteris-based SoCs shipped to date across a wide range of applications.
DSP and AI are generally considered separate disciplines with different application solutions. In their early stages (before programmable processors), DSP implementations were discrete, built around a digital multiplier-accumulator (MAC). AI inference implementations also build on a MAC as their primitive. If the interconnect were programmable, could the MAC-based hardware be the same for both and still be efficient? Flex Logix says yes with their next-generation InferX reconfigurable DSP and AI IP.
Blocking-up tensors with a less complex interconnect
If your first thought reading that intro was, “FPGAs already do that,” you’re not alone. When tearing into something like an AMD Versal, one sees AI engines, DSP engines, and a programmable network on chip. But there’s also a lot of other stuff, making it a big, expensive, power-hungry chip that can only go in a limited number of places able to support its needs.
And, particularly in DSP applications, the full reconfigurability of an FPGA isn’t needed. Having large numbers of routable MACs sounds like a good idea, but configuring them together dumps massive overhead into the interconnect structure. A traditional FPGA looks like 80% interconnect and 20% logic, a point most simplified block diagrams gloss over.
Flex Logix CEO Geoff Tate credits his co-founder and CTO Cheng Wang with taking a fresh look at the problem. On one side are these powerful but massive FPGAs. On the other side sit DSP IP blocks from competitors that don’t pivot from their optimized MAC pipeline to sporadic AI workloads with vastly wider and often deeper MAC fields organized in layers.
Wang’s idea: create a next-generation InferX 2.5 tile built around a tensor processor unit, each with eight blocks of 64 MACs (INT8 x INT8) tied to memory and a more efficient eFPGA-based interconnect. With 512 MACs per TPU and 8192 MACs per tile, each tile delivers 16 TOPS peak at 1 GHz. It’s flipped the percentages: 80% of the InferX 2.5 unit is hardwired, yet it retains 100% reconfigurability. One tile in TSMC 5nm is a bit more than 5mm2, a 3x to 5x improvement over competitive DSP cores for equivalent DSP throughputs.
Software makes reconfigurable DSP and AI IP work
The above tile is the same for either DSP or AI applications – configuration happens in software.
The required DSP operations for a project are usually close to being locked down before committing to hardware. InferX 2.5, with its software, can handle any function: FFT, FIR, IIR, Kalman filtering, matrix math, and more, at INT 16×16 or INT 16×8 precision. One tile delivers 4 TOPS (INT16 x INT16), or in DSP lingo 2 TeraMACs/sec, at 1 GHz. Flex Logix codes a library that handles softlogic and function APIs, simplifying application development. Another footprint-saving step is an InferX 2.5 tile that can be reconfigured in less than 3usec, enabling a function quick-change for the next pipeline step.
AI configurations use the same tile with different Flex Logix software. INT 8 precision is usually enough for edge AI inference, meaning a single tile and its 16 tensor units push 16 TOPS at 1 GHz. The 3usec reconfigurability allows layers or even entire models to switch processing instantly. Flex Logix AI quantization, compilers, and softlogic handle the mapping for models in PyTorch, TensorFlow Lite, or ONNX, so application developers don’t need to know hardware details to get up and running. And, with the reconfigurability, teams don’t need to commit to an inference model until ready and can change models as often as required during a project.
Scalability comes with multiple tiles. N tiles provide N times the performance in DSP or AI applications, and tiles can run functions independently for more flexibility. Tate says so far, customers have not required more than eight tiles for their needs, and points out larger arrays are possible. Tiles can also be power managed – below, an InferX 2.5 configuration has four powered tiles and four managed tiles that can be powered down to save energy.
Ready to deliver more performance within SoC power and area limits
Stacking InferX 2.5 up against today’s NVIDIA baseline provides additional insight. Two InferX 2.5 tiles in an SoC check in around 10mm2 and less than 5W – and deliver the same Yolo v5 performance as a much larger external 60W Orin AGX. Putting this in perspective, below is super-resolution Yolo v5L6 running on an SoC with InferX 2.5.
Tate says what he hears in customer discussions is that transformer models are coming – maybe displacing convolutional and recurrent neural networks (CNNs and RNNs). At the same time, AI inference is moving into SoCs with other integrated capabilities. Uncertainty around models is high, while area and power requirements for edge AI have finite boundaries. InferX 2.5 can run any model, including transformer models, efficiently.
Whether the need is DSP or AI, InferX is ready for the performance, power, and area challenge. For more on the InferX 2.5 reconfigurable DSP and AI IP story, please see the following:
SoC test challenges arise due to the complexity and diversity of the functional blocks integrated into the chip. As SoCs become more complex, it becomes increasingly difficult to access all of the functional blocks within the chip for testing. SoCs also can contain billions of transistors, making it extremely time-consuming to test chips. As test time directly impacts test cost, minimizing test time is critical to managing the cost of a finished product. Automatic Test Pattern Generator (ATPG) is a crucial part of SoC testing, as it generates test patterns to detect faults in the design. However, the automation of ATPG is a challenging task, especially for complex SoCs, due to the large number of functional blocks and test points that need to be covered. Developing efficient and effective ATPG algorithms is a key challenge for SoC testing. But many of the ATPG tools today are not fully automated. Users have to learn all the commands and the options offered by the tools in order to use them effectively.
Is there a solution that brings some automation to the ATPG process, thereby enhancing engineering productivity? What if this solution also delivers significant savings in test time? Siemens EDA’s Tessent Streaming Scan Network (SSN) solution promises to deliver these benefits. This was substantiated by Intel, one of Siemens EDA’s customers during the recent User2User conference. Intel’s Toai Vo presented proof points based on his team’s experience with their first design using Tessent SSN solution. His team included Kevin Li, Joe Chou and Chienkuo (Tom) Woo.
Tessent SSN Solution
In a standard scan testing approach, test data is loaded into the circuit one bit at a time and shifted through the scan chains to observe the output responses. This process is repeated for each test pattern, which can be time-consuming and can lead to long test times. But the Tessent SSN solution packetizes test data to dramatically reduce DFT implementation effort and reduce manufacturing test times. By decoupling core-level and chip-level DFT requirements, each core can be designed with the most optimal compression configuration for that core. This solution can be used to efficiently test large and complex chips that have a high number of internal nodes that need to be tested. It uses a dedicated network to transmit test data in a streaming manner, enabling parallel processing of the data and thereby reducing test time.
Scalability
The Streaming Scan Network supports scalable scan architectures that can handle SoCs with a large number of functional blocks. The tool provides scalable approach of testing any number of cores concurrently while minimizing test time and scan data volume. Tessent SSN test infrastructure is built around the IEEE 1687/IJTAG standard for delivering greater flexibility and scalability to handle more complex designs and test scenarios.
Automation
The hierarchical object oriented nature of the test infrastructure lends itself for easier automation. Using Tessent infrastructure, a user can easily insert test logic into a chip. The process begins with the RTL design, where the SSN test logic is inserted using automation.
Test Time Savings
Using a traditional ATPG approach, normally only block can be run at a time which extends total test time. With the Tessent SSN ATPG approach, multiple blocks can be run in parallel, thereby greatly reducing the total test time. The following table shows the test time savings achieved by Toai’s team on their design.
Summary
Toai’s team found it very easy to migrate from a traditional embedded deterministic testing (EDT) channel based ATPG to a packet-based ATPG with SSN. The Tessent SSN solution greatly reduced engineering effort and silicon bring up time. And the test time reduction was significant compared to a traditional solution for testing. In Toai Vo’s words, it is absolutely an innovative test solution and it really works.
Dan is joined by Chris Morrison, Chris has 15 years’ experience in delivering innovative analog, digital, power management and audio solutions for International electronics companies, and developing strong relationships with key partners across the semiconductor industry. Currently he is the Director of Product Marketing at Agile Analog, the analog IP innovators. Previously he has held engineering positions, including 10 years at Dialog Semiconductor, now acquired by Renesas.
Chris details some of the new developments at Agile Analog, including foundry ecosystem expansion and new product introductions that are coming. Chris also explains the details behind how Agile Analog puts a digital “wrapper” around analog IP subsystems. The benefits of this approach for AMS integration are detailed, along with information about the targeted, customized delivery methodology used by Agile Analog.
The views, thoughts, and opinions expressed in these podcasts belong solely to the speaker, and not to the speaker’s employer, organization, committee or any other group or individual.
At the recent Synopsys Users Group Meeting (SNUG) I had the honor of leading a panel of experts on the topic of chiplets. One of those panelists was the very personable Dr. Henry Sheng, Group Director of R&D in the EDA Group at Synopsys. Henry currently leads engineering for 3DIC, advanced technology and visualization.
Are we seeing other markets move in this direction?
We’re seeing a broad movement for multi-die systems for some very good reasons. Early on some of the advantages were seen in the area of high performance computing (HPC) but now automotive is starting to adopt multi-die systems.
There are other technical motivations such as heterogeneous integration. If you migrate a design to the most advance process node, do you really need the entire system to be at that three nanometer node? Or do you implement the service functions of your system with some different technology node. Memory access has been another game changer where in the past you have to go through a board to get the memory, and then with interposers you can get much closer and with much higher bandwidth.
Stacking unleashes a lot of possibilities. It’s not necessarily just memories, but also applications such image sensors. Instead of taking data through a straw, eventually you’re getting to the point where it is raining down data into your compute die. I think there’s a lot to like about multi-die system, from a lot of different applications.
What other industry collaborations, IP, and methodologies are required to address the system-level complexity challenge?
There’s a lot of collaboration needed. John just mentioned the partnership that Synopsys has with ANSYS on system analysis. That kind of collaboration is really key. Back in the day, you had manufacturing, design and tooling all under one roof. And then over time – market forces and – market efficiencies pulled that apart into different enterprises. But while that’s economics, the nature of the technical problem is still very much intertwined. And if you look across this panel, you see a very tightly connected graph amongst all of us here. There’s a lot of collaborations that’s needed. And I think that’s pretty remarkable. I don’t know how many other industries that have this deep level of collaboration in order to mutually compete, but also to make progress.
You’ll see things like UCIe as a prime example. Standards are just the tip of the iceberg. Underneath that, there’s a whole lot of different collaborations needed to move the needle. More formalization, more standardization. This morning’s keynote called out a need for more standardization around chiplets.
And then with our friends at TSMC and 3DFabric and 3DBlox you’re starting to see what we’ve always seen in 2D in the emergence of formalization and alignment between different participants in the ecosystem. So I think it’s vital and I think we’ve always done it. So I’m pretty confident there’s a lot of rich material for collaboration and we will continue to come up with collaborative solutions.
How are the EDA design flows and the associated IP evolving and where do customers want to see them go?
It’s evolved a lot. It was mentioned earlier that multi-die system is not new. We started working on it probably 12 years ago. But it’s only recently where the commercial significance and the complexity has grown and evolved from more of a hobbyist type of environment earlier. Now it is become more of a professional environment, where what we’re trying to do is to evolve it from design methods a few years ago which basically revolved around assembly – sYou have components, and assemble components together. Now we’re getting into more of a multi-die system type of activity, going from an assembly problem to more of a design automation problem and trying to elevate it to where you’re now looking at designing the system together, because the chips are so co-dependent on each other. You can’t design the chiplets in isolation from each other because there’s a host of inter-related dependencies.
Principally where we are as an industry, we’ve invested decades of work into highly complex products and flows, and we don’t want to throw that away, right? You don’t want to disrupt that. You want to ride on top of that and augment it.
Where I see the EDA space going – we will continue to see a lot of the fine-grained optimizations that you would see in a traditional 2D problem space. Where I come from in Place and Route, you have a lot of very nice and almost convex problems that are fairly suitable to apply traditional techniques to solve them.
However, when you get to system level, these problems get kind of lumpy, and your solution space can become highly non-convex and difficult to solve with traditional techniques. That’s where looking into future on AI and ML and these kinds of things that can really help drive it forward.
So design has evolved from manual implementation, to computer-aided design, to electronic design automation, to AI-driven design automation. And probably in the future, instead of computer-aided design, maybe it becomes human-aided design. The AI will tell me “Hey Henry, I need that spec tightened up by next week. I need you to get that to me.“ With the complexity, you really need the automation in order to reasonably build and optimize these systems.
Do you see multi-die system as a significant driver for this technology moving forward?
Yes. For things like silicon lifecycle management that’s an emerging for 2D – if it’s important for 2D, it’s even more so for 3D.
If you look at it from the standpoint of yield, normally you look at 2D dies and there’s the concept of known good dies. So you can test before you put it on all in. But really if you look at a multi-die system, the system yield is the product of your yields, right? So even if you have all known good dies, you still have to put them together. And so there’s some multiplicative factors and you can roughly translate that same type of analysis over into the overall health of the system as well which depends on the multiplicative health of the components.
You have heterogeneous dies with known different properties, different workloads, and different behaviors across your different dies. So it becomes all the more important to be able to keep on top of that in monitoring.
Memories have always played a critical role, both in pushing the envelope on the semiconductor process development front and supporting the varied requirements of different applications and use-cases. The list of the various types of memories in use today runs long. At a gross level, we can classify memories into volatile or non-volatile, read-only or read-write, static or dynamic, etc. And when it comes to the cost, performance, power and area/form factor of an electronic system, a lot rides on the use of all the right memories for the application. The lion’s share of the attention is paid to the effective use of Static Random Access Memories (SRAMs) and Dynamic Random Access Memories (DRAMs) as per the tradeoff benefits to be derived. While the need for higher density memories that consume very low power and perform like SRAMs has always been there, applications were able to manage with a judicious mix of DRAMs and SRAMs.
But over the recent years, fast growing markets such as modem, edge connectivity and EdgeAI have started demanding more from memories. Additionally, with the rise of the Smart Internet of Things (IoT) and wearable technology, there is an increasing demand for memory solutions that can provide high performance and low power consumption to extend battery life. These applications want memories that deliver the performance and power benefits of SRAMs (over DRAMs) and the density and cost benefits of DRAMs (over SRAMs) rolled into one. Fortunately, such a type of memory was invented quite a while ago and is called the Pseudo Static Random Access memory (PSRAM). PSRAM manufacturers were waiting in the wings for adoption drivers such as the above mentioned fast growing applications. The list of PSRAM memory suppliers includes AP Memory, Infineon, Micron Technology, Winbond Technology, and others.
What is PSRAM? [Source: JEDEC.org]
(1) A combinational form of a dynamic RAM that incorporates various refresh and control circuits on-chip (e.g., refresh address counter and multiplexer, interval timer, arbiter). These circuits allow the PSRAM operating characteristics to closely resemble those of a SRAM.
(2) A random-access memory whose internal structure is a dynamic memory with refresh control signals generated internally, in the standby mode, so that it can mimic the function of a static memory.
(3) PSRAMs have nonmultiplexed address lines and pinouts similar to SRAMs.
Mobiveil
Mobiveil is a fast-growing technology company that specializes in the development of Silicon Intellectual Properties, platforms and solutions for various fast growing markets. Its strategy is to grow with fast burgeoning markets by offering its customers valuable IPs that are easy to integrate into SoCs. One such IP is Mobiveil’s PSRAM Controller which has been in mass production for more than half-a-decade with customers across the US, Europe, Israel and China. The controller is available in different system bus flavors such as AXI and AHB and supports a variety of PSRAM and HyperRAM devices from many suppliers. The company recently expanded the list with the addition of support for AP Memory’s latest 250MHz PSRAM devices.
AP Memory
AP Memory is a world leader in PSRAM and has shipped more than six-billion PSRAM devices to date. The company has positioned itself as a market leader in PSRAM devices, providing a complete product line of high-quality memory solutions to support IoT and wearables market segments. The company continuously launches competitive products and provides customized memory solutions based on customer requirements.
Mobiveil-AP Memory Partnership
This partnership expects to bring significant benefits for SoCs, as PSRAM devices offer 10x higher density over eSRAM, 10x lower power compared to standard DRAM, and close to 3x fewer pin count. These advantages will result in lower power consumption, higher performance, and cost savings for the systems that leverage PSRAMs.
The result of the partnership is a controller IP that will provide cost-effective, ultra-low-power memory solutions for system designers. Mobiveil has adapted its PSRAM Controller to interface with AP Memory’s new PSRAM device that goes up to 250 MHz in speed and densities from 64Mb to 512Mb, supporting x8/x16 modes. This integration will allow SoC designers to take advantage of the high performance of the PSRAM controller at very low power, making it ideal for battery-operated applications, and extending the standby time of devices.
The PSRAM controller supports Octal Serial Peripheral Interface (Xccela standard), enabling speeds of up to 1,000 Mbytes/s for a 16-pin SPI option. Additionally, it provides support for a direct memory mapped system interface, automatic page boundary handling, linear/wrap/continuous/hybrid/burst support, and low power features like deep and half power down.
Mobiveil’s flexible business models, strong industry presence through strategic alliances and key partnerships, dedicated integration support, and engineering development centers located in Milpitas, CA, Chennai, Bangalore, Hyderabad and Rajkot, India, and sales offices and representatives located worldwide, have added tremendous value to customers in executing their product goals within budget and on time. To learn more, visit www.mobiveil.com.
There were quite a few announcements at the TSMC Technical Symposium last week but the most important, in my opinion, were based on TSMC N3 tape-outs. Not only is N3 the leading 3nm process it is the only one in mass production which is why all of the top tier semiconductor companies are using it. TSMC N3 will be the most successful node in the history of the TSMC FinFET family, absolutely.
(Graphic: TSMC)
In order to tape-out to 3nm you need IP and high speed SerDes IP is critical for HPC applications such as AI which is now the big semiconductor driver for leading edge silicon. Enabling chiplets at 3nm is also a big deal and that is the focus of this well worded announcement:
Successful launch of 3nm connectivity silicon brings chiplet-enabled custom silicon platforms to the forefront Alphawave Semi 3nm Eye Diagram
(Graphic: Business Wire)
LONDON, United Kingdom, and TORONTO, Canada – April 25, 2023 – Alphawave Semi (LSE: AWE), a global leader in high-speed connectivity for the world’s technology infrastructure, today announced the bring-up of its first connectivity silicon platform on TSMC’s most advanced 3nm process with its ZeusCORE Extra-Long-Reach (XLR) 1-112Gbps NRZ/PAM4 serialiser-deserialiser (“SerDes”) IP.
An industry-first live demo of Alphawave Semi’s silicon platform with 112G Ethernet and PCIe 6.0 IP on TSMC 3nm process will be unveiled at the TSMC North America Symposium in Santa Clara, CA on April 26, 2023.
The 3nm process platform is crucial for the development of a new generation of advanced chips needed to cope with the exponential growth in AI generated data, and enables higher performance, enhanced memory and I/O bandwidth, and reduced power consumption. ZeusCORE XLR Multi-Standard-Serdes (MSS) IP is the highest performance SerDes in the Alphawave Semi product portfolio and on the 3nm process will pave the way for the development of future high performance AI systems. It is a highly configurable IP that supports all leading edge NRZ and PAM4 data center standards from 1112 Gbps, supporting diverse protocols such as PCIe Gen1 to Gen6 and 1G/10G/25G/50G/100 Gbps Ethernet.
This flexible and customizable connectivity IP solution together with Alphawave Semi’s chiplet-enabled custom silicon platform which includes IO, memory and compute chiplets, allows end-users to produce high performance silicon specifically tailored to their applications. Customers can benefit from Alphawave Semi’s application optimized IP-subsystems and advanced 2.5D/3D packaging expertise to integrate advanced interfaces such Compute Express Link (CXLTM), Universal Chiplet Interconnect ExpressTM (UCIeTM), High Bandwidth Memory (HBMx), and Low-Power Double Data Rate DRAM (LP/DDRx/) onto custom chips and chiplets.
“Alphawave Semi continues to see growing demand from our hyperscaler customers for purpose-built silicon with very high-speed connectivity interfaces, fueled by an exponential increase in processing of AI-generated data”, said Mohit Gupta, SVP and GM, Custom Silicon and IP, Alphawave Semi. “We’re engaging our leading customers on chiplet-enabled 3nm custom silicon platforms which include IO, memory, and compute chiplets. Our Virtual Channel Aggregator (VCA) partnership with TSMC has provided invaluable support, and we look forward to accelerating our customers’ high-performance designs on TSMC’s 3nm process.”
About Alphawave Semi
Alphawave Semi is a global leader in high-speed connectivity for the world’s technology infrastructure. Faced with the exponential growth of data, Alphawave Semi’s technology services a critical need: enabling data to travel faster, more reliably and with higher performance at lower power. We are a vertically integrated semiconductor company, and our IP, custom silicon, and connectivity products are deployed by global tier-one customers in data centers, compute, networking, AI, 5G, autonomous vehicles, and storage. Founded in 2017 by an expert technical team with a proven track record in licensing semiconductor IP, our mission is to accelerate the critical data infrastructure at the heart of our digital world. To find out more about Alphawave Semi, visit: awavesemi.com.
I’ve been following Solido as a start-up EDA vendor since 2005, then they were acquired by Siemens in 2017. At the recent User2User event there was a presentation by Kwonchil Kang, of Samsung Electronics on the topic, ML-enabled Statistical Circuit Verification Methodology using Solido. For high reliability circuits there is a high-sigma requirement, and 6 sigma equates to 10 failures per 10,135,946,920 samples, or simulations. Using multiple Process, Voltage and Temperature (PVT) corners creates even more simulations. Using a brute-force approach to reach high-sigma by Monte Carlo simulations simply takes too much time.
There is a reduced Monte Carlo approach that tries to scale to 6-sigma, but for a bandgap reference circuit example with 36 PVT corners it requires 3,000 simulations per PVT corner, or 108,000 simulations for all 36 PVT corners, and the limited accuracy comes as long tail or non-gaussian characteristics are introduced.
The Solido approach uses Artificial Intelligence (AI) for variation-aware design and verification with Solido Variation Designer, and there are two components:
PVTMC Verifier – finds worst-case corner for target sigma and design sensitivities to variation
High-Sigma Verifier – High-sigma verification 1,000 to 1,000,000,00 faster than brute-force simulation
There are several steps to the AI algorithm used in the Solido tools, and the first step is to generate Monte Carlo samples, but don’t simulate them. Next, simulate initial samples, and then sort all of the samples and simulate them in order. Simulating even more samples will then capture the true yield at a target sigma.
Simulate samples around target sigma
With this Solido AI approach, and the resulting Probability Density Function (PDF) would look like the example below:
Probability Density Function
Probability Density Function
The dashed blue line is the verified PDF fit. Green dots are the initial samples, and dark dots the Monte Carlo results. The orange dots are ordered samples.
For the actual bandgap reference circuit described in the presentation, Solido Variation Designer achieved verification equivalent to 10 billion brute-force simulations in just 24,100 simulations translating to a speed-up of some 415,000X.
PVTMC Verifier covers all PVT corners and runs Monte Carlo in a way that requires only a few hundred simulations to capture the target sigma, thus reducing the number of simulations across the corners. The results are accurate as there are no extrapolations used or Gaussian assumptions, because it’s using real simulations at the target sigma. It’s covering all PVTs in a single run of the tool.
PVTMC Verifier example results
Inside of the PVTMC Verifier it’s identifying ordinals classes for all PVTs, capturing a distribution for each class, then verifying distributions within known classes. On the bandgap reference circuit described in the presentation, PVTMC Verifier ran a 6-sigma verification across all 36 PVTs in just 11,000 simulations, a speed-up of 32,000,000 compared to brute-force Monte Carlo.
The tool flow for using Solido AI is that a circuit netlist is run through PVTMC Verifier to select the worst-case statistical points, simulates the samples at multiple scales, observe the response to a change in scale, then it builds a model to predict the unscaled yield estimate. These first-pass results are then sent to the high-sigma verifier which runs initial samples until model building is successful, uses AI to generate Monte Carlo samples, then runs tail samples until the result is verified.
Using the Solido AI methodology required only 300 simulations per PVT with PVTMC Verifier (10,800 simulations) plus 20,000 simulations with High-Sigma Verifier, so a total of 34,900 simulations. The accuracy matched brute-force Monte Carlo, however the results completed 10,000,000X faster
Summary
At Samsung they are using Solido AI technology to achieve their goals of high-sigma verification across IC applications, while having much shorter run times than using brute-force Monte Carlo simulations. They used PVTMC Verifier to give first-pass results across all PVTs, then followed with High-Sigma Verifier for the final verification on critical and worst-case PVTs.
MOSFET gate resistance is a very important parameter, determining many characteristics of MOSFETs and CMOS circuits, such as:
• Switching speed
• RC delay
• Fmax – maximum frequency of oscillations
• Gate (thermal) noise
• Series resistance and quality factor in MOS capacitors and varactors
• Switching speed and uniformity in power FETs
• Many other device and circuit characteristics
Many academic and research papers have been written about gate resistance. However, for practical work of IC designers and layout engineers, many important things have not been discussed or explained, for example:
• Is gate resistance handled by SPICE models or by parasitic extraction tools?
• How do parasitic extraction tools handle gate resistance?
• How can one evaluate gate resistance from the layout or from extracted, post-layout netlist?
• How can one identify if gate resistance is limited by the “intrinsic” gate resistance (gate poly), or by gate metallization routing, and what are the most critical layers and polygons?
• Is gate distributed effect (factors of 1/3 and 1/12, for single- and double-contacted poly) captured in IC design flow (in PDK)?
• Is vertical gate resistance component captured in foundry PDKs?
• Should the gate be made wider or narrower, to reduce gate resistance?
• What’s the difference between handling gate resistance in PDKs for RF versus regular MOSFETs or p-cells?
The purpose of this article is to demystify these questions, and to provide some insights for IC design and layout engineers to better understand gate resistance in their designs.
Gate resistance definition and measurement
Gate resistance is an “effective” resistance from the driving point (gate port, or gate driver), to the MOSFET gate instance pin(s) – see Fig.1. (instance pin is a connection point between a terminal of SPICE model and resistive network a net).
Figure 1. MOSFET cross-section and schematic illustration of gate resistance.
However, the simplicity of the schematic in Figure 1 may be very misleading. Gate nets can be very large in size, contain many driving points, many (dozens of) layers (metal and via), millions of polygons, and up to millions of gate instance pins (connection points for device SPICE model gate terminals) – see Figure 2.
Figure 2. Schematic illustration of the top-view and cross-sectional view of MOSFET gate network
Gate network forms a large distributed system, with one or several driving points, and many destination points.
Very often, gate net looks and behaves as a huge, regular clock network, distributing the gate voltage to a FET.
Deriving an equivalent, effective gate resistance for such a large and complex system is not a simple and straightforward task. SPICE circuit simulation does not explicitly report gate resistance value.
Knowing the value of gate resistance is very useful to estimate the speed of switching, delay, noise, Fmax, and other characteristics, to see if characteristics are within the spec. Also, knowing the contributions to the gate resistance – by layer, and by layout polygons – is very useful to guide the layout optimization efforts.
Gate resistance handling by parasitic extraction tools
To understand gate resistance in IC design flow, it’s important to know how parasitic extraction tools treat and model it.
All industry-standard parasitic extraction tools handle gate resistance and its extraction similarly. In layout, the MOS gate structure is represented by a 2D mask traditionally called “poly” – even though the material can be formed by a complex gate metal stack and may have a complex 3D structure.
They fracture the poly line at the intersection with the active (diffusion) layer, breaking it into “gate poly” (poly over active) and “field poly” (poly outside active), as shown in Figure 3.
Figure 3. R and RC extraction around MOSFET gate.
Gate poly is also fractured at the center point. Gate instance pin of the MOSFET (SPICE model) is connected to the center point of the gate poly. Gate poly is described by two parasitic resistors, connecting the fracture points. A more accurate model of the gate poly, with two positive and one negative resistor, can be enabled in the PDK, but some foundries prefer not to use it (see next section on Gate Delta Model).
Parasitic resistors representing the field poly are connected to the gate contacts or to MEOL (Middle-End-Of-Line) layers and further to upper metal layers.
MOSFET extrinsic parasitic capacitance between gate poly and source / drain diffusion and contacts is calculated by parasitic extraction tools, and assigned to the nodes of the resistive networks. Different extraction tools do this differently – some tools connect these parasitic capacitances to the center point of the gate poly, while some other tools connect them to the end points of the gate poly resistors. The details of the parasitic capacitance connection to the gate resistor network may have a large, significant impact on transient and AC response, especially in advanced nodes (16nm and lower), where gate parasitic resistance is huge.
These details can be seen in the DSPF file, but are not usually discussed in the open literature or in foundry PDK documentation. Visual inspection of text DSPF files is tedious and requires some expertise. Specialized EDA tools (e.g ParagonX [3]) can be used to visualize RC networks connectivity for post-layout netlists (DSPF, SPEF), probe them (see and inspect R and C values), perform electrical analysis, and do other useful things.
Delta gate model
MOSFET gate forms a large distributed RC network along the gate width – shown in Figure 4.
This distributed network has a different AC and transient response than a simple lumped one-R and one-C circuit. It was shown [2-3] that such RC network behaves approximately the same as a network with one R and one C element, where C is the total capacitance, and R=1/3 * W/L *rsh for single-side connected poly, and R=1/12 * W/L * rsh for double-sided connected poly. These coefficients – 1/3 and 1/12 – effectively enable an accurate reduced order model for the gate, reducing a large number of R and C elements to two (or three) resistors and one capacitor.
To enable these coefficients in a standard RC netlist (SPICE netlist or DSPF), some smart folks invented a so-called Gate Delta Model – where a gate is described by two positive and one negative resistors – see Figure 5.
Figure 5. MOSFET Delta gate model.
Some SPICE simulators have problems handling negative resistors, that’s possibly why this model did not get a wide adoption. Some foundries and PDKs support delta gate model, while some others don’t.
Many people get surprised when they see negative resistors in DSPF files. If these resistors are next the gate instance pin – they are a part of the gate delta circuit.
Distributed effects along the gate length (in the direction from source to drain) are usually ignored at the circuit analysis level, due to a small value of gate length as compared to gate width.
Impact of interconnect parasitics on gate resistance
In “old” technologies, metal interconnects (metals and vias) had a very low resistance, and gate resistance was dominated by gate poly. The analysis and calculation of gate resistance was very simple.
In the latest technologies (e.g. 16nm and lower), interconnects have very high resistance, and can contribute a significant fraction (50% or more) to the gate resistance. Depending on the layout, gate resistance may have significant contributions from any layers – devices (gate poly, field poly), MEOL, or BEOL.
Figure 6 shows the results of gate resistance simulation using ParagonX [3]. Pareto chart with resistance contributions by layer helps identify the most important layers for gate resistance. Visualization of contributions by layout polygons to the gate resistance immediately points to the choke points, bottlenecks for gate resistance, that is very useful to guide layout optimization efforts.
Figure 6. Simulation results of gate resistance: (a) Gate resistance contribution by layer, and (b) contribution by polygons shown by color over the layout.
Gate resistance in FinFETs
In planar MOSFETs, the gate has a very simple planar structure, and the current flow in the gate is one-dimensional, along the direction of the gate width.
In FinFET technologies, the gate wraps around very tall silicon fins, and hence has a very complicated 3D structure. Further, gate material is selected based on the work function, to tune the threshold voltage (threshold voltage in FinFETs is tuned not by the channel doping, but by gate materials). These materials have very high resistance, much higher than solicited poly (which has typical sheet resistivity of ~10 Ohm/sq). The gate may be formed by multiple layers – interface layer with silicon, and one or more layers above it.
However, all these details are abstracted from the IC designers and layout engineers, and they see usual polygons for “poly” and for “active” – which makes design work much easier.
Handshake between SPICE model and parasitic extraction
In general, both SPICE models and parasitic extraction tools take gate resistance into account. Parasitic extraction is considered a more accurate method of calculating parasitic R and C values around the devices, since it “knows” (unlike SPICE) about the layout.
To avoid parasitic resistance and capacitance double-counting (in SPICE model and in parasitic extraction), there is a mechanism of a hand-shake between SPICE modeling and parasitic extraction, based on special instance parameters.
Regular device vs RF Pcell compact models
Regular MOSFET SPICE models do not describe gate resistance accurately enough for high frequencies, high switching speeds, or for RF or noise performance. To enable high simulation accuracy, the foundries usually recommend using RF P-cells, that have fixed size, that contain a shield (guard rings and metal cages), and that are described by high-accuracy models derived from measurements. However, these RF P-cells have a much larger area than standard MOSEFTs, and many designers prefer to use standard MOSFETs, to reduce area.
Vertical component of gate resistance
In “old” technologies (pre-16nm), gate resistance was dominated by lateral resistance. However, in advanced technologies, multiple interfaces between gate material layers lead to a large vertical gate resistance. This resistance is inversely proportional to the area of the gate poly. It can be modeled as an additional resistor connecting gate instance pin to the center point of the gate poly – see Figure 7(a). As a result, when the gate gets narrower (smaller number of fins), gate resistance goes down, but increases at very small gate widths. It displays a characteristic non-monotonic behavior, as seen in Figure 7(b). The old rule of thumb where “the narrower gate has lower gate resistance” does not work any more. Designers and layout engineers have to select the optimum (non-minimal) gate width (number of fins), to minimize gate resistance.
Figure 7. (a) Gate model accounting for vertical gate resistance, and (b) measured and simulated gate resistance versus number of fins (ref. [2]).Depending on technology, on PDK, and on foundry, the vertical gate resistance may or may not be included into parasitic extraction. It’s very easy to check this in DSPF file – if gate instance pin is connected directly to the center of the gate poly – vertical resistance is not accounted for. If it is connected by a positive resistor to the center of the gate poly – that resistors represents the vertical gate resistance.
Technology trends
With technology scaling, both gate resistances and interconnect resistances increase significantly – by up to one or two orders of magnitude. As a result, the details of the layout that were not important for gate resistance in older nodes, become very important in advanced nodes.
Other MOSFET gate-like structures
While the discussion on gate resistance in this article is focused on MOSFETs, the same arguments and approaches are applicable to other distributed systems controlled by the gate or by gate-like systems, such as:
• IGBTs (Insulated Gate Bipolar Transistors)
• Decoupling capacitors
• MOS capacitors
• Varactors
• Deep trench and other MIM-like integrated capacitors
Figure 8 shows a gate structure of a vertical MOSFET, and gate delay distribution over the device area, simulated using ParagonX [3].
Figure 8. (a) Typical layout of vertical FET, IGBT, and other gate-controlled devices. (b) Distribution of gate resistance and delay over area.
References
1. B. Razavi, et al., “Impact of distributed gate resistance on the performance of MOS devices,” IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 41, pp. 750-754, 11 1994.
2. A.J.Sholten et al., “FinFET compact modelling for analogue and RF applications”, IEDM’2010, p.190.
3. ParagonX User Guide, Diakopto Inc., 2023.