SemiWiki – Page 190 – The Open Forum for Semiconductor Professionals

June 25, 2022June 28, 2022

The Evolution of Taiwan’s Silicon Shield

The Evolution of Taiwan’s Silicon Shield
by Craig Addison on 06-25-2022 at 6:00 am
Categories: China
1 Comment

The original Silicon Shield theory, as described in my 2001 book, stated that Taiwan’s role as producer of 90 per cent of the world’s IT products (at that time) protected it from an attack by China because the United States, acting in its own self interest, would come to the island’s defense. A similar scenario – involving oil, not electronics – occurred in 1990 when the US intervened after Iraq invaded Kuwait.

Fast forward a decade after the book, and much of Taiwan’s electronics production, including laptops and mobile phones, had moved to China – although it was still controlled by Taiwanese-owned companies like Compal, Foxconn and Quanta. (The transfer of Taiwanese chip technology to China was restricted, and still is).

The 2009 Silicon Shield documentary reflected this shift by arguing that China would refrain from attacking Taiwan because of the harm it would inflict upon itself. In other words, a Cold War-style mutually assured destruction (MAD) scenario would keep the peace.

So what is the Silicon Shield today?

Both of the above still apply in their own way, but some pundits now believe the Silicon Shield may even increase the risk of Taiwan being forcibly taken by China. The “Broken Nest” theory states that Taiwan should adopt a scorched earth policy and destroy TSMC et al in the event of an attack, thus reducing the island’s value to the invaders.

While the Broken Nest has its fair share of critics, a similar scenario was foreshadowed by one of the people interviewed for the 2009 documentary. Chih-Yuan Lu, former head of Taiwan’s Submicron Project and since then president of Macronix International, said Taiwan’s semiconductor industry could be compared to jade, the precious mineral valued by the Chinese.

“If you have valuable jade in your pocket and you cannot defend yourself, there are many robbers who will target you,” Lu said at the time. In the case of two parties fighting over ownership, “at the last moment they even want to break the jade” to prevent the other from having it, he explained.

Since the outbreak of the pandemic, semiconductors have been elevated from relative obscurity to an industry of keen interest to mainstream media and the general public. The same goes for Taiwan and its role in the hi-tech supply chain. These developments motivated me to revive the original Silicon Shield documentary for a new audience.

The result is “Silicon Shield 2025” – the year being a reference to the date Taiwan’s defense minister believes China will have the ability to invade. The new version, available for streaming on Vimeo On Demand, uses the same voice-over narration and video interviews from the 2009 production, but the content has been digitally remastered and updated with HD b-roll footage as well as new material to reflect recent events. Indeed, it is remarkable how much of the original documentary narrative from 13 years ago is relevant today, perhaps more so.

SemiWiki members choosing the “rent” option on Vimeo On Demand can watch “Silicon Shield 2025” free of charge by using the promo code CHIPS, which is valid until July 25.

In addition, be sure to check out The Chip Warriors podcast – the most recent episode being on Taiwan’s Chip Warriors, featuring the above mentioned C.Y. Lu, as well as legends like TSMC founder Morris Chang.

For those interested in how Taiwan got into this situation in the first place – caught between two superpowers – check out the Nixon’s China Choice podcast. Nothing about semiconductors here, but it is a fascinating look into the minds of Nixon, Kissinger and Halderman as they sought rapprochement with Communist China while trying not to sacrifice Taiwan in the process. Nixon failed in the latter, but that set back – along with the loss of US diplomatic recognition under Carter in 1979 – provided the impetus for Taiwan’s leaders to take the enormous risk of betting their national survival on semiconductors.

Also read:

US Supply Chain Data Request Elicits a Range of Responses, from Tight-Lipped to Uptight

Losing Lithography: How the US Invented, then lost, a Critical Chipmaking Process

Why Tech Tales are Wafer Thin in Hollywood

Podcast EP90: A Tour of Cadence’s Cloud Solutions with Mahesh Turaga

Podcast EP90: A Tour of Cadence’s Cloud Solutions with Mahesh Turaga
by Daniel Nenni on 06-24-2022 at 10:00 am

Dan is joined by Mahesh Turaga, VP of Cloud Business Development at Cadence Design Systems. Mahesh brings extensive customer-facing experience to Cadence in business development, strategy, pre-sales, and consulting. He provides an overview of the cloud solutions provided by Cadence. The various business models, technical details and target customer profiles are all discussed.

Mahsh holds an MBA from Northwestern University – Kellogg School of Management and Ph.D in aeronautics, structural mechanics, composites and fluid dynamics from Purdue University.

The views, thoughts, and opinions expressed in these podcasts belong solely to the speaker, and not to the speaker’s employer, organization, committee or any other group or individual.

June 24, 2022June 24, 2022

ASML EUV Update at SPIE

ASML EUV Update at SPIE
by Scotten Jones on 06-24-2022 at 6:00 am
Categories: Events, Lithography, Semiconductor Services, TechInsights
5 Comments

At the 2022 SPIE Advanced Lithography Conference, ASML presented an update on EUV. I recently had a chance to go over the presentations with Mike Lercel of ASML. The following is a summary of our discussions.

0.33 NA

The 0.33 NA EUV systems are the production workhorse systems for leading edge lithography today. 0.33 NA systems are in high volume production for both logic and DRAM. Figure 1 illustrates the number of EUV layers for logic and DRAM (bars) and wafers exposed per year (area). Authors note, the 2021 values for logic are typical of foundry 5nm processes at 10+ EUV layers and 2023 logic would be in-line with foundry 3nm processes at ~20 layers, DRAM usages is currently ~5 layers. I asked mike about future DRAM exposures, he pointed out there are ~8 critical layers on a DRAM and eventually some of those layers could need multi-patterning bringing EUV exposure up to 10 per wafers.

Figure 1. EUV Adoption.

Through Q1 of 2022 ASML has shipped 136 EUV systems and ~70 million wafers have been exposed, see figure 2.

Figure 2. Number of EUV Wafers Exposed.

System availability continues to improve, it is at a little less than 90% today. The new NXE:3600D is better than the NXE:3400C and provide ~93% availability. EUV system availability is getting close to DUV system levels (~95%).

Figure 3. Availability.

NXE:3600D systems can produce 160 wafers per hour (wph) at 30mJ/cm², 18% better than the NXE:3400C. The NXE:3800E systems in development will provide >195 wph at 30mJ/cm² initially, and 220 wph with throughput upgrades. The NXE:3600E will have incremental optical improvements in aberration, overlay and throughput.

Figure 4. Throughput Improvements.

Matched machine overlay for the NXE:3400C was 1.5nm and is 1.1nm for the NXE:3600D. The NXE:3600D uses the same new 12 wavelength alignment system as the newest DUV systems with just a few material differences due to vacuum use.

The ASML roadmap includes the NXE:4000F with >220wph around 2025, see figure 5.

Figure 5. System Roadmap.

Pellicles now achieve greater than 90% transmission and manufacturing has been transferred to Mitsui. I run into people from time to time who think Pellicles are a future EUV item, but Pellicles have been in production use on select layers for over a year.

Figure 6. Pellicle Performance.

Finally for the 0.33 NA system ASML is working on reducing the energy required for each exposure by increasing throughput and decreasing total energy.

Figure 7. Energy Per Exposure.

We discussed the ultimate resolution limits for 0.33 NA systems, in theory 0.33 NA can produce 26nm in a single exposure, currently Imec is working on 28nm single exposure, but it isn’t in production yet.

0.55 NA (High NA)

As described in the previous section 0.33 NA EUV is in high volume production. Leading edge foundry processes have now reached the 3nm “node” and double patterning with 0.33 NA EUV is becoming necessary. By raising the NA from 0.33 to 0.55 double patterned layers can be replaced with single exposures.

Figure 8 illustrates how DUV layers counts grew driven by process complexity and multipatterning until 0.33 NA EUV took over eliminating a lot of multipatterning. As 0.33 NA EUV multipatterning use grows 0.55 NA EUV can eliminate some multipatterning reducing layer counts again.

Figure 8. Mask Count Trends.

High-NA provides a better image log slope, stochastic defects are 3D and high-NA helps with defect reduction. ASML is working on attenuated phase shift masks for EUV to improve contrast and depth of field. They will be implemented for 0.33 NA first and then 0.55 NA later.

ASML’s roadmap has the first High NA system (EXE:5000) being installed in a lab at the ASML factory run jointly with Imec in 2023 for initial evaluation. EXE:5000 systems should be delivered to customers in 2024 and the production EXE:5200 system should be delivered to customers for production use around 2025, see figure 9.

Figure 9. High-NA System Roadmap.

The optics for High-NA are significantly larger than for 0.33 NA and require a unique design approach. 0.55 NA systems will have an anamorphic lens system with a 4x reduction ratio in one direction (the same as 0.33 NA) and an 8x reduction ratio in the orthogonal direction. Due to the size of the reticle and the 8x reduction, the printable field size is cut in half to 16.5nm in the scan direction, see figure 10.

Figure 10. Anamorphic Lens System.

Simulations show no direction differences between a half-field and full field exposure. Half-field exposures can be aligned to full field exposures so that existing DUV and 0.33 NA EUV systems can be used in a mix and match strategy with 0.55 NA systems. If necessary for large die, 0.55 NA half-field exposures can stitched together, possibly with a small stitch boundary for global connections.

Using research tools at The Center for X-Ray Optics at Berkely and Paul Scherrer Institut, ASML has been able to demonstrate High-NA EUV resolution down to 8, see figure 11.

Figure 11. 8nm Line/Spaces.

The 0.55 NA system design is broken up into 4 independently testable sub systems (see figure 12) and assembly of the first exposure tool to go into the ASML/Imec lab in 2023 has begun (see figure 13).

Figure 12. High-NA Sub Systems.

Figure 13. 0.55 NA System Integration.

ASML continues to work on increasing source power and has recently demonstrated >500 watts in research. Historically it has taken ~2 years for research developments to reach production. Figure 14 illustrates source power over time.

Figure 14. Source Power Trends.

0.7 NA

In a recent article Tom Dillinger discussed an interview with Mark Phillips of Intel and Mark mentioned 0.7 NA as a successor to 0.55 NA. I was surprised by this, I thought ASML had ruled out developing anything after 0.55 NA due to the high investments ASML has had to make on EUV. Mike said ASML hasn’t ruled out a 0.7 or greater NA system, they are looking at it. He said they have ruled out shorter wavelengths than the current 13.5nm (authors note, at one time there was some discussion of a shorter wavelength system 6.xnm). They do want any new system to be air shippable which limits how much bigger the system can be than the 0.55 NA systems.

Conclusion

0.33 NA EUV systems are now production work horse systems with continuously improving availability and throughput. 0.55 NA systems are expected to enter production in 2025 with higher resolution enabling process simplification. Beyond 0.55 NA ASML is looking at even higher NA systems. EUV is well positioned to continue to drive lithography resolution for the next decade.

Also Read:

Obscuration-Induced Pitch Incompatibilities in High-NA EUV Lithography

The Electron Spread Function in EUV Lithography

Double Diffraction in EUV Masks: Seeing Through The Illusion of Symmetry

June 23, 2022July 18, 2025

Using STA with Aging Analysis for Robust IC Designs

Using STA with Aging Analysis for Robust IC Designs
by Daniel Payne on 06-23-2022 at 10:00 am
Categories: EDA, Synopsys

Our laptops and desktop computers have billions of transistors in their application processor chips, yet I often don’t consider the reliability effects of aging that the transistors experience in the chips. At the recent Synopsys User Group (aka SNUG), there was a technical presentation on this topic from Srinivas Bodapati, an engineer at Intel.

Device Aging

As transistors are switched on and off the drain currents can over time slowly decrease, this in turn changes path delays that make the chip speed slow down, and even fail to meet specifications. Device aging is now a first order problem when designing leading edge processor chips and GPUs. To manage power dissipation, many SoC design employ Dynamic Voltage Frequency Scaling (DVFS) techniques, yet the stress from running with a high VDD begins to impact circuit operation when in low VDD mode.

Gate Level Aging: Specification Failure

Device aging is dependent on workload, Voltage, Temperature and Frequency, and the two effects that cause transistor performance to shift over age are:

Bias Temperature Instability (BTI)
Hot Carriers Injection (HCI)

The device aging mechanisms for HCI and BTI are summarized in this table as a function of each factor:

Device Aging Mechanisms

At 14nm the main aging contribution was from BTI, but at 10nm it was from HCI effects. At the same time the End Of Life (EOL) drive currents increased by 1.65X, going from 14nm to 10nm.

Use Condition Problems

With DVFS circuits during the high VDD frequency mode there is stress to the transistors, which then impacts the circuit operation in low VDD mode. The delay of gates can become slower through aging, even to the point of getting out of specification, causing a timing failure.

During Static Timing Analysis (STA), the challenge is to model the workload dependency of aging, and consider that input slope plus output load impact aging. Consider an SoC example where there is a high performance core (PCore), an efficiency core (Ecore), fabric, and system IP blocks. These four types of IP have very different supply voltage ranges, and also temperatures. Trying to use the same static guard band for each IP block would be overly pessimistic for some scenarios, so using an existing aged library cannot really capture all of the various stress scenarios.

Aging for different circuits

STA Aging Model Complexity

In the example below there’s a launch path, and a capture path, but each path has a unique switching activity which then changes their aging degradation to be different amounts. For each path the effects of both BTI and HCI also need to be taken into account, as aging degradation depends on each

Launch path, Capture path

Old and New Approaches

The older approach was to use STA with Aged Libraries and then have path simulation for derates. The drawbacks of the older approach are that DVFS usage is not accounted for, the BTI vs HCI effects are not separated, and it required handcrafted paths. The other challenge is the productivity bottleneck, as the STA and simulation are typically handled by different expert teams, modeling aging involves multiple cycles of identifying paths, running simulations, and analyzing results to come up with derates, which can then be used for modeling aging, however, these derates can often be pessimistic.

The new approach is an aging-aware STA methodology, which has automated workload dependency, simulates actual paths, takes into account BTI and HCI tradeoffs, works within a single simulation structure, and supports scalability of aging mission profiles without trading off for accuracy and enabling them to find the actual worst-case.

Aging-Aware STA Flow

The Synopsys tool for this aging-aware flow is called PrimeShield, and there are two components:

Aging STA
Aging-aware SPICE simulation

Intel used the aging-aware SPICE simulation component, where the circuit designer specifies a set of paths for simulation in Simlink. This enabled the specify and create stress conditions and simulates with HSPICE creating a degradation file that is used to generate playback with fresh conditions to measure the aging impact. Aging-aware Simlink enables easier stress conditions creation and automates the impact of aging at various other stress condition, based on initial inputs.

Aging-Aware STA Flow

On the other hand aging-aware STA flow eases the methodology further by using aged libraries with mission profile information to calculate the impact of aging on the actual paths using the Synopsys PrimeTime’s PBA methodology. It also enables designers to configures the stress waveform by setting the cycle count, an activity factor, a signal probability, age time, and stress voltage ratio.

Results

Using the aging-aware flow they wanted to see the workload dependency of slack degradation, and the reference case is called slack2, where both the launch clock and capture clock have an activity factor of 0.2, shown in the table below:

Workload dependency of slack degradation

Slack2 is the reference scenario, with equal activity factors for launch and capture clocks. The other three scenarios have a variety of activity factors for launch and capture clocks, and the yellow table shows how the slack degradation increases for each scenario, with scenario slack82 having the worst case slack degradation. These results depend on the effects of HCI and BTI.

Running and plotting many paths to compare normalized degraded slack versus normalized reference slack is shown in the next plot. The legend shows four types of results:

Launch clock at 0.2, capture clock at 0.2 (l2c2)
Launch clock at 0.8, capture clock at 0.8 (l8c8)
Launch clock at 0.8, capture clock at 0.3 (l8c2)
Launch clock at 0.2, capture clock at 0.8 (l2c8)

Normalized results

This helps designers identify worst case corners for each IP block in a path aging flow.

Conclusions

Running STA with aging effects is quite complex, especially when using DVFS design techniques, and aging depends on workloads to get accurate answers. Intel designers working with Synopsys tools and AEs have developed an aging-aware STA flow that uses PrimeShield, Simlink and HSPICE together for path simulations. Reliability issues are now first order, so having automation for aging analysis in a STA flow is a must have feature.

Related Blogs

June 23, 2022July 21, 2022

Scaling Safety Analysis. Reusability for FMEDA

Scaling Safety Analysis. Reusability for FMEDA
by Bernard Murphy on 06-23-2022 at 6:00 am
Categories: Arteris, Automotive, IP

It is common when a new type of analysis is introduced in almost any domain that it works well enough for a while. Until it begins to struggle with growing problem size, prompting refinements to the methodology to allow continued scaling. We see this routinely in analytics for SoC design, so it should not be a big surprise that safety analysis, in the form of failure modes, effects and diagnostic analysis (FMEDA), is starting to look a little creaky. Which is no small concern. FMEDAs are the contracts passed up from IP developers to SoC integrators, providing assurance that safety weaknesses have been fully analyzed and mitigated. This is not a requirement we want to short-change because the analysis problem becomes too messy.

Configurability – the root cause

Most IPs are configurable, even in-house IPs, because to be useful as reusable components, they must be able to adapt to a variety of different SoC applications. There is no IP more configurable than a network-on-chip (NoC). The whole structure of the NoC will change depending on how many components it must connect. And what quality of service goals it must meet, how it should adapt to minimize congestion and so on.

All this configurability is essential to meet SoC design goals, but it comes with a downside. FMEDA is a flat characterization based on fault simulation, run on the component as configured. There is generally no way to analyze a parametrized IP before configuration. SoC integrators must run the analysis per IP, even on commercial components. IP suppliers will provide as much help as they can in the form of templates and advice, but the burden of final and lengthy FMEDA remains with the integrator. The SoC team must repeat this analysis if the configuration changes, all adding up to a lot of extra work.

The core problem is a lack of reusability in FMEDA. If this could be restructured to support reuse, then IP suppliers could provide a means to generate not only a configured IP but also an FMEDA for that IP. Integrators could avoid most of the effort in repeating flat analyses in this case

Redesigning FMEDA for reuse

FMEDAs, as they stand, are not parametrizable, but they could be generated through a combination of low-level safety models and a compiler which could read those models together with the configured IP RTL to determine how root causes will propagate to effects. Arteris IP has written a thought leadership paper on putting these ideas into practice. Cutting out much of the unnecessary rework in rebuilding FMEDAs. Conceptually this makes sense to me. Failure modes don’t change on configuration. There may be more or less of a certain type in some cases, added or subtracted in predictable ways. How these can propagate to effects also won’t change much except as perhaps you could analyze through interpolation between a few carefully selected configurations. Allowing a tool to compute the influence on the likelihood of failure. The concept seems very reasonable.

You could extend automated generation not only to generating IP FMEDAs but also to generating the SoC FMEDA. Apparently, leading semis in the automotive space already do something like this internally. The SoC generator must aggregate FMEDAs from the IPs. Applying in-context requirements and assumptions of use to abstract failure modes to those relevant to system behavior. Adding this functionality with IP FMEDA generation could take a lot of the pain out of safety analysis for SoC integrators.

You can learn more about this topic in this Arteris IP presentation HERE.

Also Read:

Why Traceability Now? Blame Custom SoC Demand

Assembly Automation. Repair or Replace?

Experimenting for Better Floorplans

Podcast EP89: An Overview of NXP’s MCX MCU Products with CK Phua

Podcast EP89: An Overview of NXP’s MCX MCU Products with CK Phua
by Daniel Nenni on 06-22-2022 at 10:00 am

Dan is joined by CK Phua of NXP. CK joined Philips Semiconductors in 1993 and worked in various roles including quality, applications engineering, product engineering and technical marketing. After Philips, CK joined Freescale in 2012 and rejoined NXP through the Freescale merger. CK is now a Product Manager for Microcontrollers in the Edge Processing Business Line.

CK provides a detailed overview of NXP’s MCX product line and its product families, including architecture and capabilities across a broad range of applications. The supporting development environment is also discussed, as well as security capabilities.

The views, thoughts, and opinions expressed in these podcasts belong solely to the speaker, and not to the speaker’s employer, organization, committee or any other group or individual.

June 22, 2022July 2, 2022

TSMC 2022 Technology Symposium Review – Process Technology Development

TSMC 2022 Technology Symposium Review – Process Technology Development
by Tom Dillinger on 06-22-2022 at 5:00 am
Categories: Events, Foundries, TSMC
7 Comments

TSMC recently held their annual Technology Symposium in Santa Clara, CA. The presentations provided a comprehensive overview of their status and upcoming roadmap, covering all facets of process technology and advanced packaging development. This article will summarize the highlights of the process technology updates – a subsequent article will cover the advanced packaging area.

First, here is a brief overview of some of the general observations and broader industry trends, as reported by C.C. Wei, TSMC CEO.

General

“This year marks TSMC’s 35th anniversary. In 1987, we had 258 employees in one location, and released 28 products across 3 technologies. Ten years later, we had 5,600 employees, and released 915 products across 20 technologies. This year in 2022, we have 63,000 employees, and will release 12,000 products across 300 technologies.”
“From 2018 to 2022, the volume of 12” (equivalent) wafers has had an annual CAGR exceeding 70%. In particular, we are seeing a significant increase in the number of ‘big die’ products.” (>500mm**2)
“In 2021, TSMC’s North America business segment shipped more than 7M wafers and over 5,500 products. There were 700 new products tapeouts (NTOs). This segment represents 65% of TSMC’s revenue.”
“Our gigafab expansion plans have typically involved adding two new ‘phases’ each year – that was the case from 2017-2019. In 2020, we opened six new phases, including our advanced packaging fab. In 2021, there were seven new phases, including fabs in Taiwan and overseas – advanced packaging capacity was added, as well. In 2022, there will be 5 new phases, both in Taiwan and overseas.”
- N2 fabrication: Fab20 in Hsinchu
- N3: Fab 18 in Tainan
- N7 and N28: Fab22 in Kaohsiung
- N28: Fab16 in Nanjing China
- N16, N28, and specialty technologies: Fab23 in Kumanoto Japan (in 2024)
- N5 in Arizona (in 2024)
“TSMC has 55% of the worldwide installed base of EUV lithography systems.”
“We are expanding our capital equipment investment significantly in 2022.” (The table below highlights the considerable jump in cap equipment planned expenditures.)

“We are experiencing stress in the manufacturing capacity of mature process nodes. In 35 years, we have never increased the capacity of a mature node after a subsequent node has ramped to high volume manufacturing – that is changing.”
“We continue to invest heavily in “intelligent manufacturing”, focusing on precision process control, tool productivity, and quality. Each gigafab handles 10M dispatch orders per day, and optimizes tool productivity. Each gigafab generates 70B data points daily to actively monitor.”

For the first time at the Symposium, a special “Innovation Zone” on the exhibit floor was allocated. The recent product offerings from a number of start-up companies were highlighted. TSMC indicated, “We have increased our support investment to assist small companies adopt our technologies. There is a dedicated team that focuses on start-ups. Support for smaller customers has always been a focus. Perhaps somewhere in this area will be the next Nvidia.”

Process Technology Review

With a couple of exceptions discussed further on, the process technology roadmap presentations were somewhat routine – that’s not a bad thing, but rather an indication of ongoing successful execution of prior roadmaps.

The roadmap updates were presented twice, once as part of the technology agenda, and again as part of TSMC’s focus on platform solutions. Recall that TSMC has specifically identified four “platforms” that individually receive development investment to optimize the process technology offerings: mobile; high-performance computing (HPC); automotive; and IoT (ultra-low power). The summaries below merge the two presentations.

N7/N6

over 400 NTOs by year-end 2022, primarily in the smartphone and CPU markets
N6 offers transparent migration from N7, enabling IP re-use
N6RF will be the RF solution for upcoming WiFi7 products
there is an N7HPC variant (not shown in the figure above), providing ~10% performance improvement at overdrive VDD levels

For N6, logic cell-based blocks can be re-implemented in a new library for additional performance improvements, achieving a major logic density improvement (~18%).

N5/N4

in the 3rd year of production, with over 2M wafers shipped, 150 NTOs by year-end 2022
mobile customers were the first, followed by HPC products
roadmap includes ongoing N4 process enhancements
N4P foundation IP is ready, interface IP available in 3Q2022 (to the v1.0 PDK)
there is an N5HPC variant (not shown in the figure above, ~8% perf improvement, HVM in 2H22)

As with the N7/N6, N4 provides “design re-use” compatibility with N5 hard IP, with a cell-based block re-implementation option.

The complexity of SoC designs for the automotive segment is accelerating. There will be an N5A process variant for the automotive platform, qualified to AEC-Q100 Grade 1 environmental and reliability targets (target date: 2H22). The N5A automotive process qualification involves both modeling and analysis updates (e.g., device aging models, thermal-aware electromigration analysis).

N3 and N3E

N3 will be in HVM starting in the second half of 2022
N3E process variant in HVM one year later; TSMC is expecting broad adoption across mobile and HPC platforms
N3E is ready for design start (v0.9 PDK), with high yield on the standard 256Mb memory array qualification testsite
N3E adds the “FinFLEX” methodology option, with three different cell libraries optimized for different PPA requirements (more at the end of this article)

Note that N3 and N3E are somewhat of an anomaly to the prior TSMC process roadmap. N3E will not offer a transparent migration of IP from N3. The N3E offering is a bit of a “correction”, in that significant design rule changes to N3 were adopted to improve yield.

TSMC’s early-adopter customers push for process PPA updates on an aggressive timeline, whether an incremental, compatible variant to an existing baseline (e.g., N7 to N6, N5 to N4), or for a new node. The original N3 process definition has a good pipeline of NTOs, but N3E will be the foundation for future variants.

N2

based on a nanosheet technology, target production date: 2025
compared to N3E, N2 will offer ~10-15% performance improvement (@iso-power, 0.75V) or ~25-30% power reduction (@iso-perf, 0.75V); note also the specified operating range in the figure above down to 0.55V
N2 will offer support for a backside power distribution network

Parenthetically, TSMC is faced with the dilemma that the requirements of the different platforms have such a broad range of targets for power, performance, and area/cost. As was noted above, N3E is addressing these targets with different libraries, incorporating a different number of fins that define the cell height. For N2 library design, this design decision is replaced by a process technology decision on the number of vertically-stacked nanosheets throughout (with some allowed variation in the device nanosheet width). It will be interesting to see what TSMC chooses to offer for N2 to cover the mobile and HPC markets, in terms of the nanosheet topology. (The image below from an earlier TSMC technical presentation at the VLSI 2022 Conference depicts 3 nanosheets.)

NB: There are two emerging process technologies being pursued to reduce power delivery impedance and improve local routability – i.e., “buried” power rail (BPR) and “backside” power distribution (BSPDN). The initial investigations into offering BPR have quickly expanded to process roadmaps that integrate full BSPDN, like N2. Yet, it is easy to get the two acronyms confused.

Specialty Technologies

TSMC defines the following offerings into a class denoted as “specialty technologies”:

ultra-low power/ultra-low leakage (utilizing an ultra-high Vt device variant)
- requires specific focus on ultra-low leakage SRAM bitcell design
- N12e in production, N6e in development (focus on very low VDD model support)

(embedded) non-volatile memory
- usually integrated with a microcontroller (MCU), typically in a ULP/ULL process
- RRAM
  - requires 2 additional masks, embedded in BEOL (much lower cost than the 12 masks for eFlash)
  - 10K write cycles (endurance specification), ~10 years retention @125C

- MRAM
  - 22MRAM in production, focus is on improving endurance
  - 16MRAM for Automotive Grade 1 applications in 2023
power management ICs (PMIC)
- based on bipolar-CMOS-DMOS (BCD) devices: 40BCD+, 22BCD+
- for complex 48V/12V power domains
- requires extremely low device R_on
high voltage applications (e.g., display drivers, using N80HV or N55HV)
analog/mixed-signal applications, requiring unique active and passive structures (e.g., precision thin-film resistors and low noise devices, using N22ULL and N16FFC)
MEMS (used in motion sensors, pressure sensors)
CMOS image sensors (CIS)
- pixel size of 1.75um in N65, 0.5um in N28, transitioning to N12FFC
radio frequency (RF), spanning from mmWave to longer wavelength wireless communication; the upcoming WiFi7 standard was highlighted

“The transition from WiFi6 to WiFi7 will require a significant increase in area and power, to support the increased bandwidth requirements – e.g., 2.2X area and 2.1X power. TSMC is qualifying the N6RF offering, with a ~30-40% power reduction compared to N16RF. This will allow customers currently using N16RF to roughly maintain existing power/area targets, when developing WiFi7 designs.”

The charts below illustrate how these specialty technologies are a fundamental part of platform products – e.g., smartphones and automotive products. The characteristic process nodes used for these applications are also shown.

Although the focus of smartphone development tends to be on the main application processor, the chart below highlights the extremely diverse requirements for specialty technology offerings, and their related features. In the automotive area, the transition to a “zonal control” architecture will require a new set of automotive ICs.

N3E and FinFLEX

The FinFLEX methodology announcement was emphasized, with TSMC indicating “FinFLEX will offer full-node scaling from N5.”

As FinFET technology nodes have scaled – i.e., from N16 to N10 to N7 to N5 – the fin profile and drive current_per_micron have improved significantly. Standard cell library design has evolved to incorporating fewer pFET and nFET fins that define the cell height (specified in terms of the number of horizontal metal routing tracks). As illustrated above, the N5 library used a 2-2 fin definition – that is, 2 pFET fins and 2 nFET fins to define the cell height. (N16/N12 used a 3-3 configuration.)

The library definition for N3E was faced with a couple of issues. Mobile and HPC platform applications are increasingly divergent, in terms of their PPA (and cost) goals. Mobile products focus on circuit density to integrate more functionality and/or reduced power, with less demanding performance improvements. HPC is much more focused on maximizing performance.

As a result, N3E will offer three libraries, as depicted in the figure above:

- an ultra low power library (cell height based on a 1-fin library)
- an efficient library (cell height based on a 2-fin library)
- a performance library (cell height based on a 3-fin library)

The figure below is from TSMC’s FinFLEX web site, illustrating the concept (link).

Now, offering multiple libraries for integration on a single SoC is not new. For years, processor companies have developed unique “datapath” and “control logic” library offerings, with different targets for: cell heights, circuit performance, routability (i.e., max cell area utilization), and distinct logic offerings (e.g., wide AND-OR gates for datapath multiplexing). Yet, the physical implementation of SoC designs using multiple libraries relied upon a consistent library per design block.

The unique nature of the FinFLEX methodology is that multiple libraries and multiple track heights will be intermixed within a block.

After the TSMC Symposium, additional information became available. A block design will alternate rows for the two libraries. For example, a 3:2 block design will have alternate row heights accommodating cells from the 3-fin and 2-fin library designs. A 2:1 block design will have alternate rows for cells from the 2-fin and 1-fin libraries.

TSMC indicated, “Different cell heights (in separate rows) are enabled in one block to optimize PPA. FinFLEX in N3E incorporates new design rules, new layout techniques, and significant changes to EDA implementation flows.”

There will certainly be more information to come about FinFLEX and the changes to the general design flow. Off-hand, there will need to be new approaches to:

- physical synthesis
  - how will synthesis improve timing on a critical signal
  - will synthesis strive to provide a netlist with a balanced ratio of cells from the two libraries for the alternating rows

For example, to improve timing on a highly-loaded signal, synthesis would typically update a cell assignment in the library to the next higher drive strength – e.g., NAND2_1X to NAND2_2X.

With FinFLEX, additional options are available with the second library – e.g., whether an update to NAND2_1X_2fin would use NAND2_2X_2fin or NAND2_1X_3fin. Yet, if the latter is chosen, the new cell will need to be “re-balanced” to a different row in the block floorplan. The effective changes in performance and input/output wire loading for these choices are potentially quite complex to estimate during physical synthesis.

The cell selection options get even more intricate when considering specific flop cells to use, given not only the differences in clock-to-Q delays, but also the setup and hold time characteristics, and input clock loading. When would it be better for individual flop bits in a register to use different output drive strengths in the same library (and be placed locally) versus having register bits re-balanced to a row corresponding to a different library selection?

With an alternating row configuration, the assumption is that there will be an even mix of cells from the two libraries. Yet, the synthesis of a block may only require a small percentage of “high-performance” cells to meet timing objectives. An output netlist without a balanced mix of library cells may have low overall utilization, suggesting a uniform row, single-library block floorplan may be suitable instead. This may result in iterations in the chip floorplan (and likely, revisions in the power distribution network, as well).

- sub-block level IP integration

Blocks often contain a number of small hard IP macros, such as register files (typically provided by a register file generator). With non-uniform row heights, the algorithms in the generator become more complex, to align the power continuity between the macro circuits and the cell rows. And, there will be placement restriction rules that will need to be added to the hard IP models.

- timing/power optimizations during physical design

Similarly to the physical synthesis block construction options, there will be difficult decisions on cell selection during the timing and power optimization steps in the physical design flow. For example, if a cell can reduce its assigned drive strength to save power while still meeting timing, would a change in library selection, and thus row re-balancing, be considered? Would the corresponding changes in the cell placement negate the optimization?

and, last but most certainly not least,

- Will there be new EDA license costs to enable N3E FinFLEX?

(Years ago, the CAD department manager at a previous employer of mine went viral at the license cost adder to enable placement and routing for multipatterning requirements. Given the significant EDA investment required to support FinFLEX, history may repeat itself with additional license feature costs.)

The FinFLEX methodology definitely offers some intriguing options. It will be extremely interesting to see how this approach evolves.

Analog design migration automation

Lastly, TSMC briefly highlighted work they are pursuing in the area of assisting designers migrate analog/mixed-signal circuits and layouts to newer process nodes.

Specifically, TSMC has defined a set of “analog cells”, with the capability to take an existing schematic, re-map to a new node, evaluate circuit optimizations, and migrate layouts, including auto-placement and (PG + signal) routing.

The definition of the analog cell libraries for N5/N4 and N3E are complete, with N7/N6 support to follow. TSMC showed an example of an operational transconductance amplifier (OTA) that had been through the migration flow.

Look for more details to follow. (This initiative appears to overlap with comparable features available from EDA vendor custom physical design platforms.)

A subsequent article will cover TSMC’s advanced packaging announcements at the 2022 Technology Symposium.

-chipguy

Also read:

Three Key Takeaways from the 2022 TSMC Technical Symposium!

Inverse Lithography Technology – A Status Update from TSMC

TSMC N3 will be a Record Setting Node!

June 21, 2022June 22, 2022

Qualcomm’s AI play

Qualcomm’s AI play
by Anand Joshi on 06-21-2022 at 10:00 am
Categories: AI

Qualcomm is a common name in mobile industry for chips. The company has generated $33 billion in revenue in 2021 and continues to march ahead with its innovations. However, Qualcomm doesn’t get the same visibility and mention as Nvidia and Intel in the world of AI chips. By our estimate, Qualcomm’s contribution to AI chip market is comparable to Intel and Nvidia given the volume shipment of smartphones and silicon content dedicated to AI in recent years. Qualcomm has been steadily making progress on key AI chip markets and perhaps has the most diverse and comprehensive portfolio to cater all AI chip markets.

Figure shows different segments within AI chip market and products in each

AI chip market has grown significantly in the past few years and you can read all about it in JP Data’s latest report on AI chips. According to the analysis, overall AI chip market can be best segmented by power consumption: data center AI chips segment (50+W), mid power AI chips (5-50W, primarily for automotive and such markets), low power AI chips (0.1-5W, primarily for mobile and client computing) and ultra-low power AI chips (<0.1W for always on applications). There’s no sign of slowdown in AI yet with enterprises as well as edge device markers eager to test out new solutions. Many use cases and exciting applications are continuing to emerge. Proof of concept applications that are going into production are driving the need for AI inference chips.

Qualcomm is poised to play in all markets which sets it apart from other companies. For the data-center market, the company has introduced AI100 chip and results submitted on MLPerf compete well with Nvidia. Qualcomm boasts its significantly higher performance per watt than the competition. Qualcomm is actively adapting its Snapdragon product line to support automotive market and recently claimed design wins at BMW. Qualcomm’s dominance in low power market segment within mobile world is well known and needs no introduction. The same chips offers ultra low power mode for always on applications enabling a whole new set of AI use cases for device manufacturers.

This makes its portfolio even more comprehensive than Nvidia and Intel if we keep training aspect aside. Nvidia for example, doesn’t have products in the mobile space and neither does Intel. Intel and Nvidia don’t have solutions for ultra-low power market either.

Qualcomm was somewhat late to the party and focused earlier on accelerating AI via enhancing its Hexagon DSP and Adreno GPU. The company then acquired Nuvia to create new AI accelerator. At Microsoft’s 2022 Build conference, the company announced Project Volterra, a new device powered by Snapdragon chips that contain AI accelerator, NPU. The dedicated accelerator will become part of Microsoft’s Windows 11. Via the included SDK to build AI applications, the chip will enable AI usage within large number of Windows applications to potentially challenge X86 dominance in PC world.

Qualcomm has invested heavily into AI since. Qualcomm announced 100 million AI fund way back in 2018, has aggressively invested in AI R&D and released SDK that allows developers to take a model and customize it for mobile, automotive, IoT, robotics or other markets. While there is no data on active AI developers for Qualcomm, we expect the number to be much lower than bragging rights gained by Nvidia and Intel. In fact, Google trends search reveals that the searches for Qualcomm AI are far below Nvidia AI or Intel AI suggesting that there’s a lot of catching up to do.

The AI chip market is still emerging. Nvidia has become de-facto standard in training but the inference market is just starting its ramp up. If Qualcomm is indeed able to offer a consistent software experience across different market segments, it has a potential to become a formidable player in the AI chip market.

A Fresh Look at HLS Value

A Fresh Look at HLS Value
by Bernard Murphy on 06-21-2022 at 6:00 am
Categories: 5G, AI, EDA, Siemens EDA

I’ve written several articles on High-Level Synthesis (HLS), designing in C, C++ or SystemC, then synthesizing to RTL. There is unquestionable appeal to the concept. A higher level of abstraction enables a function to be described in less lines of code (LOC). Which immediately offers higher productivity and implies less bugs because the number of bugs in any kind of code scales pretty reliably with LOC. Simulation for architectural design and validation runs multiple orders of magnitude faster, allowing for broader experimentation with options. It also can run much larger tests like image recognition on streaming video, a tough goal for RTL simulations. Yet these methods have largely been restricted to specialized design objectives it seemed. Signal processing functions, some simple ML inference engines, that sort of thing.

I’m always willing to be re-educated, especially when I can hear from customers. Siemens EDA just hosted a webinar, mostly customer talks on use of HLS with just a little marketing thrown in. Pretty much a full day of presentations, centering around a few core applications, which made me rethink my position. The algorithm classes the technology best serves haven’t changed so much. What has changed is that big market needs have shifted to overlap more with those algorithms. Check out which companies presented on these topics. Naturally, when these speakers talked about HLS, they meant Catapult from Siemens EDA.

Video Codecs

There’s been a massive worldwide increase in cloud video workload. According to Google, video now accounts for more than 80% of internet traffic, thanks to streaming and YouTube in particular. Aki Kuusela of Google said that this volume demands warehouse scale encoding with fast throughput. From his perspective the whole warehouse must be viewed as a system – storage, networking, codec, compute, etc. – to optimize for this level of traffic and throughput. Moreover, codecs must support a variety of video formats, required to be seamless from the latest formats, to popular standards to legacy standards. Think of YouTube; every minute 500 hours of new content is uploaded, and tens of thousands of live streams must be served simultaneously.

Off the shelf solutions can’t meet this need. For the same reason Google built their own ML training platforms (TPUs), they are building their own codecs which must be optimized across traffic diversity, quality, throughput, and availability that only they can reproduce. Google started early with HLS to integrate with the YouTube stack. Nvidia is doing very similar work, also on video codecs. The world leader in GPUs, for gaming, for graphics, for AI needs to have the fastest and highest quality video. Of course they are building their own codecs.

Object detection for the Mars sample return program

Another cool video example (but not a codec) is from NASA/JPL. This from the team that brought you Ingenuity, the Mars helicopter. Now they are designing something called a Harris corner detector, an image-related algorithm, as a part of development for the Mars sample return project. The original implementation was in RTL as a DSP-like function, but this proved difficult to optimize. The speaker describes approaches using SystemC, implementing a DSP process or a Kahn (essentially self-timed) process, using the flexibility HLS offers for experimenting with these options.

OK, so video applications like these are still in that same algorithmic niche I was talking about earlier. But the business relevance of the video processing niche has exploded. Carrying HLS along with it.

Wireless applications

NXP, as a leader in automotive electronics, is working on a complete baseband for ultra-wideband (UWB). The technology you will soon be using for ultra-secure keyless entry to your car (your current Bluetooth-enabled keyless entry is not so secure). At some point maybe also contactless payment for the same reason. They found their traditional approach to designing the baseband, starting from Simulink, was too slow to converge. Much of the functionality here is signal processing; think filters and equalizers in multiple channels for example. Such a design demands high levels of parallelism at high clock rates which is difficult to architect in a timing-unaware platform. The application must also be very low power; think of UWB in a car key fob, running off a coin cell battery. These designs must build on custom-crafted signal processing.

A new company, Viosoft, is building a complete RAN physical layer for 5G (the radio unit piece of the network), from rate matching/channel mapping to model, time/frequency synchronization, MIMO/beamforming to RF processing and more. This must handle multiple bandwidth and latency requirements and multiple transmission frequencies. Once more lots of signal processing with huge demand for flexibility. The application will be built on an FPGA but still must be power optimized because it will be sitting in a potentially remote location.

Wireless, lots of signal processing, and low power demand. Once again requiring custom design solutions, built through HLS.

Smart sensing and wireless power transfer

ST provided a fascinating 3-part pitch. The first section was on infrared sensing for people detection in a room using a smart sensor. This technology can be useful for energy-saving controls. Sensing is on a grid within a room, allowing for machine learning of patterns of movement, thus a neural network which is where they use HLS.

The next application was a Qi (wireless power transmission) demodulator, a modem-like (and therefore DSP-like) function extracting power rather than information from the signal. The third application was a contactless infrared sensor, something familiar to all of us now thanks to COVID. A prior implementation did the temperature calcs in an embedded processor. This work pushes the calculation into the smart sensor, first establishing a correction for ambient temperature and for the sensed object temperature, then using Stefan Boltzmann law (yay physics!) to compute the temperature of the object. Note these are simply formulae, not DSP or ML operations, but they do use floating point math for precision, so the HLS approach was an easy choice.

What I like here is the applicability of HLS to these consumer-oriented applications, where cost and power will both be critical.

Wrap up

I skipped a couple of talks, one from Nvidia research on modeling interconnect in SystemC to get some feel for latencies as a function of layout. Another was from Siemens EDA on MatchLib, the open-source library originally developed by Nvidia in support of this modeling. All good stuff but not directly relevant to my theme here of the compelling demand for HLS in multiple applications.

Bottom line, best fit algorithms still tend to be signal processing centric, but big markets now see huge value in custom hardware development around those algorithms. You can watch the entire set of talks HERE.

Also read:

HLS in a Stanford Edge ML Accelerator Design

Standardization of Chiplet Models for Heterogeneous Integration

Using EM/IR Analysis for Efinix FPGAs

June 20, 2022July 19, 2022

How to Cut Costs of Conversational AI by up to 90%

How to Cut Costs of Conversational AI by up to 90%
by Dave Bursky on 06-20-2022 at 10:00 am
Categories: Achronix, AI, eFPGA, FPGA

The burgeoning use of conversational artificial intelligence (CAI) in consumer and business applications places a heavy computational burden on both front-end and back-end systems that provide the natural language processing (NLP). NLP systems rely on deep learning (a subset of machine learning) to automate speech recognition, perform the NLP functions, and then provide the text to speech output. To cut costs of the NLP systems, Achronix and Myrtle.ai have partnered, promising to cut costs by 90% as well as reducing the hardware requirements, described in this whitepaper.

Myrtle.ai, a technology specialist in FPGA AI inferencing, implements performant recurrent neural networks (RNN)-based networks on FPGAs using their MAU inferencing acceleration engine. The MAU engine, integrated into the Achronix Speedster®7t AC7t1500 FPGA, leverages key architectural aspects of the Speedster7t architecture to drastically increase the acceleration of real-time automatic speech recognition (ASR) neural networks. That translates into a 2500% increase in the number of real-time streams that can be processed when compared to a server-class CPU.

The CAI pipeline is often defined by three key functional blocks:

Speech to text (STT), also known as automatic speech recognition (ASR)
Natural language processing (NLP)
Text to speech (TTS) or speech synthesis

Such pipelines are found in the millions virtual voice assistants such as Apple’s Siri or Amazon’s Alexa, or voice search assistants on laptops such as Microsoft’s Cortana, as well as automated call center (or contact center) agents and many other applications. The deep learning algorithms that power these CAI services are either processed on the local electronic device or aggregated in the cloud for remote processing at scale. Large-scale deployments supporting millions of consumer interactions represent extremely large compute processing challenges that hyperscaler providers have addressed by developing specialized silicon devices to address the processing of these services.

State of the art ASR algorithms are implemented with end-to-end deep learning. Recurrent neural networks (RNN), unlike convolutional neural networks (CNNs), are common in speech recognition. As noted in “CNN vs. RNN: How are they different?” by David Petersson from TechTarget. RNNs are better suited for processing temporal data, aligning well with ASR applications. RNN-based models require high compute capabilities and high memory bandwidths to process the neural network model within the strict latency targets required for conversational systems. When real-time or automated responses are too slow, the system appears sluggish and unnatural. Often low latency is only achieved at the expense of the processing efficiency which pushes up costs and can become too large for practical deployment.

Competing FPGA architectures in the ML acceleration segment claim teraoperations/second (TOPS) rates for inferencing as high as 150 TOPS. Yet in real-world applications, especially those which are latency sensitive such as ASR, these FPGAs fall well short of their headline TOPS rates due to their inability to efficiently transfer data between the compute and external memory. The Achronix Speedster7t architecture strikes the right balance of compute engines, eight high-speed memory interfaces (4 Tbit/s GDDR6 memory interfaces) and high-throughput data transfers (20 Tbit/s network on chip), yielding a device that can deliver 64% of the headline TOPS rates for real- time, low-latency ASR workloads (see the figure).

At the heart of the Speedster 7t architecture are the 2560 machine-learning processor (MLP) blocks. These blocks contain an optimized matrix/vector multiplication function capable of 32 multiplies and one accumulate in a single clock cycle. This is the foundation for the compute engine architecture. Block RAM (BRAM) is co-located with each of the 2560 instances of the MLPs in the AC7t1500, which equates to lower latency and higher throughput. Myrtle.ai’s MAU low latency, high throughput ML inferencing engine has been integrated into the Achronix Speedster7t FPGA, leveraging 2000 of the 2560 MLPs. Because the MLP is a hard block, it can run at a much higher clock rate than if implemented in the FPGA fabric itself.

Most ASR solutions offered by large-scale cloud service providers such as Google, Amazon, Microsoft Azure, and Oracle allow service providers to build products on top of these cloud APIs. However, the service providers face increasingly large bills as their operations scale out, and those products achieve success in the market.

The publicly advertised cost of the larger ASR providers range from $0.01 to $0.025 per minute, and Industry reports suggest that the average call center call is approximately five minutes. Consider a large enterprise data or call center services company fielding 50,000 calls per day at five minutes per call. At the stated rates above, the cost of the ASR processing would range from $1,500 to $6,000 per day or $500,000 to $2,000,000 per year. The Achronix and Myrtle.ai solution can support 4000 RTS on one accelerator card, delivering the capacity to handle over one million calls per day.

There are many factors that would dictate the cost of a stand-alone ASR appliance. For this particular example, assume the Achronix ASR acceleration solution delivered on an FPGA-based PCIe card integrated into an x86-based 2U server. Sold from a system integrator, this appliance might be $50,000 and the annual cost of running the server could double that cost. This leads to $100,000 for the first year for an on-premise ASR appliance. Comparing this on-premise solution versus cloud API services, the end user can enjoy a savings of 5X to 20X in the first year.

Achronix and Myrtle.ai are teaming up to deliver an ASR platform consisting of a 200W, x16 PCIe Gen4-based accelerator card and the associated software which together can sustain up to 4000 RTS concurrently, processing up to 1 million five-minute transcriptions per 24-hour period. Comparing this PCIe accelerator card on a single ×86 server to the cost of cloud ASR services, the first year CAPEX and OPEX can be reduced by as much as 90%.

To download the full whitepaper, visit achronix.com.

Also read:

Benefits of a 2D Network On Chip for FPGAs

5G Requires Rethinking Deployment Strategies

Integrated 2D NoC vs a Soft Implemented 2D NoC