Smartphone Processor Trends and​ Process Differences down through 7nm

Smartphone Processor Trends and​ Process Differences down through 7nm
by Fred Chen on 08-30-2020 at 6:00 am

Transistor density process for Huawei and Apple

This comparison of smartphone processors from different companies and fab processes was originally going to be a post, but with the growing information content, I had to put it into an article. Here, due to information availability, Apple, Huawei, and Samsung Exynos processors will get the most coverage, but a few Qualcomm Snapdragon processors will also be included in some comparisons.

The Processes
The processors compared here will be fabbed at Samsung and TSMC, starting from 14/16nm and going down to 7nm EUV versions.

What’s being compared
Die width and die height will be compared among the processors from each of the different companies. Transistor density data (available only for certain processors) will be used for process comparisons.

Smartphone processor die sizes
In Figure 1, the die size trends for the smartphone processors from Samsung, Huawei, and Apple are separately plotted vs. the different processes used.

Figure 1. Die size trends vs. process for Samsung (left), Huawei (center), and Apple (right). Qualcomm is added at far left for die area only.

For Samsung, the introduction of 7LPP enabled a die height reduction. However, unexpectedly, its 91.83 mm2 area does not give the smallest die area among all the processors considered here. Among 7nm processors, the smallest processor area goes to the Snapdragon 855 (73.3 mm2), fabricated on TSMC’s original 7nm process. The Snapdragon 835 was even smaller at 72.3 mm2, but is made on Samsung’s 10nm (LPE) process, with a much lower transistor density. The other 7nm EUV processor, the Huawei Kirin 990 5G made at TSMC, also had enlarged die size (113.3 mm2), but this can be attributed to new features in the processor design [1].

Die width is not trending down with advanced processes. This will be a concern for the use of EUV, as discussed in detail later. With shrinking cell track heights, the impact of illumination rotation will become more significant.

Transistor Density
Transistor density is plotted for Huawei and Apple processors vs. process in Figure 2.

Figure 2. Transistor density vs. process for Huawei (left) and Apple (right).

The biggest surprise here comes from TSMC’s 7nm EUV process NOT giving the highest transistor density. Among the Kirin processors shown, the Kirin 980 gives the highest density (93.1 MTr/mm2) which is higher than the Kirin 990 5G at 90.9 MTr/mm2. The other processor which beat this value is the Snapdragon 855, coming in at 91.4 MTr/mm2.

The highest densities and smallest die sizes so far at 7nm were realized on TSMC’s first 7nm process. The TSMC 7nm process in fact has a shorter high-density track height (240 nm) [2] than Samsung’s 7nm EUV process (243 nm) [3]. The Exynos 990 in fact used the high-performance track height, which is 270 nm. These actually offset the potential benefits of a smaller metal pitch.

Going to 5nm, track height is expected to be reduced, especially with 6-track cells becoming available.

Track height reduction consequences for EUV
Samsung’s 7nm EUV process offers 270 nm (7.5-track) and 243 nm (6.75-track) cell heights. The 5nm continuation of this process also offers a 216 nm (6-track) cell height [4]. The process is considered a continuation because the minimum metal pitch remains at 36 nm. The minimum metal pitch has a strong influence on the EUV process, as it sets a preferred illumination angle (whose sine = 0.1875 to be exact). However, this illumination angle is rotated across the die, up to 18.2 degrees at 13 mm from the center [5]. Since the die width for the Samsung Exynos processors shown in Figure 1 have been in the neighborhood of 10.7 mm, we should consider the effect of a 7.5 degrees (=18.2 degrees x 5.35 mm/13 mm) maximum rotation at the chip edge compared to the center. The effect is not so profound for the 36 nm pitch itself but more so for the track height being the true pitch. The much larger track height as pitch generates a more complex diffraction order spectrum. The phase difference between the 0th and 1st orders is normally not affected significantly by the incident angle “shadow” in the x-direction but the rotation changes this (Figure 3).

Figure 3. The impact of 7.5 degree rotation of illumination for 243 nm (top) and 216 nm (bottom) track heights. For the rotated case, defocus generates a larger range of phase errors across the pupil (different angle tils in x-direction). Thus, images at the die edge go out of focus more easily.

The lines in the 6- or 6.75-track cell will go out of focus more easily at the die edge. The effect is more severe not only as the minimum metal pitch decreases but also as track height decreases, due to larger path differences between consecutive orders at smaller pitches.

What to expect in the future
Now that Huawei’s supply from TSMC has been interrupted, there is a possibility it will rely on a new foundry source within China, such as SMIC [6]. It may try to first replicate the success of the Kirin 980 domestically, as mainland China has not yet reached the ‘7nm’ stage in its technology development. In the meantime, both Apple and Qualcomm continue to be successful in their work with TSMC on the 7nm ‘P’ process. With some reduction in popularity of the Exynos processor series, Samsung’s Exynos processor designs may be swapped for a non-customized ARM core design [7]; it remains to be seen if that can revitalize in-house processor design. Otherwise, Samsung’s phones can still be sold with Qualcomm’s Snapdragon processors exclusively.

References
Processor die size and transistor density information can be found from Techinsights (Exynos 8895, Exynos 9810, Exynos 990, A13, Kirin 990 5G, Snapdragon 835, Snapdragon 865), Anandtech (A9, A10X, Kirin 960, Kirin 980), Chiprebel (Exynos 9820, A11), Wikichip (A12, Kirin 970, Kirin 990 4G, Snapdragon 855).

[1] https://www.anandtech.com/show/14851/huawei-announces-kirin-990-and-kirin-990-5g-dual-soc-approach-integrated-5g-modem

[2] https://fuse.wikichip.org/news/2408/tsmc-7nm-hd-and-hp-cells-2nd-gen-7nm-and-the-snapdragon-855-dtco/

[3] https://fuse.wikichip.org/news/1479/vlsi-2018-samsungs-2nd-gen-7nm-euv-goes-hvm/

[4] https://fuse.wikichip.org/news/2823/samsung-5-nm-and-4-nm-update/

[5] A. V. Pret et al., Proc. SPIE 10809, 108090A (2018).

[6] https://www.eetasia.com/how-smic-can-keep-up-with-advanced-process-technologies-part-2/

[7] https://www.notebookcheck.net/Why-ARM-s-Cortex-X1-cores-likely-for-Samsung-s-Exynos-1000-possible-future-Pixel-SoC-too.466957.0.html

Related Lithography Posts


Thermo-compression bonding for Large Stacked HBM Die

Thermo-compression bonding for Large Stacked HBM Die
by Tom Dillinger on 07-24-2020 at 8:00 am

HMB stack

Summary

Thermo-compression bonding is used in heterogeneous 3D packaging technology – this attach method was applied to the assembly of large (12-stack and 16-stack) high bandwidth memory (HBM) die, with significant bandwidth and power improvements over traditional microbump attach.

Introduction

The rapid growth of heterogeneous die packaging technology has led to two innovative product developments.

For high-performance applications, system architects have incorporated a stack of memory die in a 2.5D package configuration with a processor chip – see the figures below for a typical implementation, and expanded cross-section.  These high-bandwidth memory (HBM) architectures typically employ four (HBM, 1st gen) or eight (HBM2/2E) DRAM die attached to a “base” memory controller die.  The stack utilizes microbumps between die, with through-silicon vias (TSV’s) for the vertical connections.

A silicon interposer with multiple redistribution metal layers (RDL) and integrated trench decoupling capacitors supports this 2.5D topology, providing both signal connectivity and the power distribution network to the die.

A more recent package innovation provides the capability to attach two heterogeneous die in a 3D configuration, in either face-to-face or face-to-back orientations (with TSV’s).  This capability was enabled by the transition of (dense) thermo-compression bonding for die attach from R&D to production status.

Previous semiwiki articles have reviewed these packaging options in detail.  [1, 2]  Note that the potential for both these technologies to be used together – i.e., 3D heterogeneous die integration (“front-end”) with 2.5D system integration (“back-end”, typically with HBM) – will offer architects with a myriad of tradeoffs, in terms of:  power, performance, yield, cost, area, volume, pin count/density, thermal behavior, and reliability.  A new EDA tools/flows discipline is emerging, to assist product developers with these tradeoffs – pathfinding.  (Look for more semiwiki articles in this area in the future.)

Thermo-compression Bonding for HBM’s

The high-performance applications for which an integrated (general-purpose or application-specific) processor and HBM are growing rapidly, and they need an increasing amount of (local) memory capacity and bandwidth.  To date, a main R&D focus has been to expand the 2.5D substrate area, to accommodate more HBM stacks.  For example, TSMC has recently announced an increase in the maximum substrate area for their 2.5D Chip-on-Wafer-on-Substrate (CoWoS) offering, to enable the extent of the interposer to exceed 1X the maximum lithographic reticle size.  RDL connections are contiguous across multiple interposer-as-wafer exposures.

Rather than continuing to push these lateral dimensions for more HBM stacks, there is a concurrent effort to increase the number of individual memory die in each stack.  Yet, the microbump standoffs with the TSV attach technology introduce additional RLC signal losses up the stack, with a less-than-optimum thermal profile, as well.

At the recent VLSI 2020 Symposium, TSMC presented their data for the application of thermo-compression bonding used in current 3D topologies directly to the assembly of the HBM stack – see the figure below. [3]

A compatibility requirement was to maintain a low-temperature bonding process similar to the microbump attach method.  Replacing the microbumps between die with thermo-compression bonds will result in reduced RLC losses, greater signal bandwidth, and less dissipated energy per bit.  The simulation analysis results from TSMC are shown below, using electrical models for the microbumps, compression bonds, and TSV’s.  Note that TSMC pushed the HBM configuration to 12-die and 16-die memory stacks, well beyond current production (microbump-based) designs.

To demonstrate the manufacturability of a very tall stack with bonding, TSMC presented linear resistance data in (bond + TSV) chains up and down the die – see the figure below.

A unique characteristic of the bonded HBM stack compared to the microbump stack was the reduction in thermal resistance.  The directly-attached dies provide a more efficient thermal path than the die separated by the microbumps.  The TSMC data is shown below, illustrating the improvement in the temperature delta between HBM stack and the top (ambient) environment.

The conclusion of the TSMC presentation offered future roadmap opportunities:

  • Tighter thermo-compression bond pitch (< 10um) is achievable, offering higher die-connections/mm**2.   (Bandwidth = I/O_count * data rate)
  • Additional R&D investment is made to pursue increased thinning of the DRAM die, further reducing the RLC insertion losses, improving the thermal resistance (and allowing more die in the same package volume).  For example, the current ~60um die thickness after back-side grinding and polishing could be pushed to perhaps ~50um.

The figure on the left below highlights the future targets for bond connection density, while the figure on the right shows the additional bandwidth and energy/bit improvements achievable with a more aggressive HBM memory die thickness.

The application of 3D packaging technology thermo-compression bonding to HBM construction will enable greater memory capacity and bandwidth, required by high-performance computing applications.  System architects now have yet another variable to optimize in their pathfinding efforts.

For more information on the 2.5D and 3D heterogeneous packaging technology offerings from TSMC, please follow this link.

-chipguy

References

[1]  https://semiwiki.com/semiconductor-manufacturers/tsmc/285129-tsmcs-advanced-ic-packaging-solutions/

[2]  https://semiwiki.com/semiconductor-manufacturers/tsmc/8150-tsmc-technology-symposium-review-part-ii/

[3]  Tsai, C.H., et al., “Low Temperature SoIC Bonding and Stacking Technology for 12/16-Hi High Bandwidth Memory (HBM)”, VLSI 2020 Symposium, Paper TH1.1.

Images supplied by the VLSI Symposium on Technology & Circuits 2020.

 


Intel 7NM Slip Causes Reassessment of Fab Model

Intel 7NM Slip Causes Reassessment of Fab Model
by Robert Maire on 07-23-2020 at 5:00 pm

Intel vs TSMC

Waving white surrender flag as TSMC dominates-
The quarter was a success but the patient is dying-
Packaging now critical as Moore progress stumbles-
Intel reported a great quarter but weak H2 guidance-
But 7NM slip and “fab lite” talk sends shockwaves-

Intel reported a great quarter beating numbers all around with revenues of $19.7B and EPS of $1.23. Revenue was $1.2B better than expected and EPS was $0.13 better than expected. Guidance was for $18.2B and EPS of $1.10 as a widespread slowdown is expected to hit in H2.

Results for the quarter were great but other more significant issues far outweigh and swamp the quarterly results. We won’t waste time regurgitating the quarterly results which are well summarized in Intels slides:

Intel Earnings presentation

7NM products delayed at least 6 months while process is a year behind
Echoes of the 10NM delay disaster. Perhaps the biggest news was that 7NM will be delayed at least another 6 months due to yield issues. This seems to put the overall 7NM delay at roughly a year. It was unclear wether this is “one and done” or if this is the beginning of another series of rolling delays as those that haunted 10NM. Either way the news is not good at all.

Rather than Intel regaining its “Mojo” as some had hoped at 7NM, to suffer another delay and fall further behind TSMC is just horrible, there is no way around it. Its a huge disappointment and heads will likely roll.

While management did suggest that the problem is understood and identified we came away without a firm feeling that it was under control, fixed or on its way to being fixed. Further slippage due to not finding a solution could easily happen as we saw at 10NM.

The 7NM slip is pushing Intel into a “fab lite” model following AMD’s lead-
Would make both Intel and AMD dependent upon TSMC…and more even-
During the earnings call, management made it quite clear that they are looking at alternatives for manufacturing of future nodes. Wether or not to outsource and how much to outsource to TSMC.

It seems from the tone of tonight’s call coupled with the 7NM slip that Intel is on the slippery slope to give more of its manufacturing to TSMC and perhaps TSMC will get to do Intel’s most leading edge manufacturing as Intel falls further behind.

Management couched it as a prudent allocation of resources and dollars but it sure sounds a lot more like waving the white flag of surrender after you’ve lost the race.

It sounds like sacrilege but Intel may be on the road to a “fab lite” model. Most semiconductor investors may not be old enough, but I can still hear the echoes of AMD’s founder, Jerry Sanders and his “real men have fabs” speech.

We can only hope that Intel can get its act together and get 7NM back on track and perhaps even make up some lost time, but we wouldn’t bet our investment dollars on it.

Intel joining Apple and AMD at TSMC’s fab on China’s doorstep…..
Apple obviously saw this coming and investors should have seen this coming with Apple’s recent announcement to give up on Intel. Apple correctly figured out that they could go straight to the source, TSMC, with their own customized design and do much better on their own

Obviously there will be little if any transistor density advantages between AMD and Intel if their advanced chips are built at the same TSMC fab. Differences will come down to design capability, which Intel continues to tout, but we don’t think there is that much there there.

The other ominous omen of Intel’s issues was likely the recent departure (for “personal reasons”) last month of Jim Keller, the famous CPU “Guru” who has had stints at AMD, Apple and Tesla designing their best CPU’s and who had joined Intel in 2018, with many hoping he could revamp things.

TSMC is obviously laughing all the way to the bank as Intel’s business will be huge upside, many times the size of Huawei business lost.

It would mean that in a couple of years, TSMC will be manufacturing every advanced chip on the planet. The demise of US chip making accelerates. We find the news of Intel going “fab lite” as a huge contradiction to the recent talk of the US governments “Chips for America” package of $22.8B in aid for the industry.

Intel’s talk of outsourcing to TSMC is in direct contradiction to Bob Swan’s personal lobbying of the White House and government officials and personal trips to DC to convince officials to have Intel lead a “trusted fab” initiative, while at the exact same time planning on outsourcing more manufacturing to TSMC .

It seems disingenuous to be lobbying to lead a US semiconductor resurgence initiative while at the same time calculating how much of the companies product to outsource to Taiwan.

The government should be highly embarrassed as Intel is the last advanced US semiconductor logic manufacturer after GloFo gave up the race. Micron is not the leader in memory. If the US government had any smarts they would match China’s $100B checkbook as well as push other efforts to keep manufacturing in the US.

TSMC’s “planned” fab in Arizona isn’t even throwing a bone to a dog as the capacity is far too little and far too far behind the leading edge to be of any consequence at all.

Packaging matters
One very interesting point that came out of the call will be Intel’s increasing reliance on advanced 3D packaging to mix and match heterogeneous die in a mixed package to optimize manufacturing and performance. Intel will be able to mix a 14NM die with a 22NM die, throw in a few memory dies into a heterogeneous package and increase Moore’s law without geometry shrinks which are obviously harder for them to do and increasingly delayed.

TSMC is already great at packaging and AMD has also pushed chiplet technology as well so unfortunately its not an advantage but just a “me too” technology for Intel.

The Stocks
Obviously Intel stock will get whacked as it did to the tune of 10% in the after market and perhaps even more so as the repercussions of the delay and outsourcing sink in.

The weak guidance doesn’t help but a weaker H2 is something we have been talking about for quite a while and the market should be expecting that. Perhaps there are still investors who think that the good times will continue into H2. Intel should be a wake up.

Intel guided capex to be $15B which is no surprise and the equipment stocks shouldn’t see much reaction from that but should likely see a negative reaction of the longer term negative out of Intel and increasing buying power of TSMC.

TSMC is looking at lot more like the old Intel with its dominance of capex spend in the industry. It is certainly not a positive for the US semiconductor industry to be so reliant on a tiny island “run away province”, soon to be re-united with mother China by any means necessary. All in all, not positive for the chip industry with perhaps the exception of TSMC and AMD.

Semiconductor Advisors

Semiconductor Advisors on SemiWiki


In-Memory Computing for Low-Power Neural Network Inference

In-Memory Computing for Low-Power Neural Network Inference
by Tom Dillinger on 07-17-2020 at 10:00 am

von Neumann bottleneck

“AI is the new electricity.”, according to Andrew Ng, Professor at Stanford University.  The potential applications for machine learning classification are vast.  Yet, current ML inference techniques are limited by the high power dissipation associated with traditional architectures.  The figure below highlights the von Neumann bottleneck.  (A von Neumann architecture refers to the separation between program execution and data storage.)

The power dissipation associated with moving neural network data – e.g., inputs, weights, and intermediate results for each layer – often far exceeds the power dissipation to perform the actual network node calculation, by 100X or more, as illustrated below.

A general diagram of a (fully-connected, “deep”) neural network is depicted below.  The fundamental operation at each node of each layer is the “multiply-accumulate” (MAC) of the node inputs, node weights, and bias.  The layer output is given by:   [y] = [W] * [x] + [b], where [x] is a one-dimensional vector of inputs from the previous layer, [W] is the 2D set of weights for the layer, and [b] is a one-dimensional vector of bias values.  The results are typically filtered through an activation function, which “normalizes” the input vector for the next layer.

For a single node, the equation above reduces to:

yi = SUM(W[i, 1:n] * x[1:n]) + bi

For CPU, GPU, or neural network accelerator hardware, each datum is represented by a specific numeric type – typically, 32-bit floating point (FP32).  The FP32 MAC computation in the processor/accelerator is power-optimized.  The data transfer operations to/from memory are the key dissipation issue.

An active area of neural network research is to investigate architectures that reduce the distance between computation and memory.  One option utilizes a 2.5D packaging technology, with high-bandwidth memory (HBM) stacks integrated with the processing unit.  Another nascent area is to investigate in-memory computing (IMC), where some degree of computation is able to be completed directly in the memory array.

Additionally, data scientists are researching how to best reduce the data values to a representation more suitable to very low-power constraints – e.g., INT8 or INT4, rather than FP32.  The best-known neural network example is the MNIST application for (0 through 9) digit recognition of hand-written numerals (often called the “Hello, World” of neural network classification).  The figure below illustrates very high accuracy achievable on this application with relatively low-precision integer weights and values, as applied to the 28×28 grayscale pixel images of handwritten digits.

One option for data type reduction would be to train the network with INT4 values from the start.  Yet, the typical (gradient descent) back-propagation algorithm that adjusts weights to reduce classification errors during training is hampered by the coarse resolution of the INT4 value.  A promising research avenue would be to conduct training with an extended data type, then quantize the network weights (e.g., to INT4) for inference usage.  The new inference data type values from quantization could be signed or unsigned (with an implicit offset).

IMC and Advanced Memory Technology

At the recent VLSI 2020 Symposium, Yih Wang, Director in the Design and Technology Platform Group at TSMC, gave an overview of areas where in-memory computing is being explored to support deep neural network inferencing.[1]   Specifically, he highlighted an example of IMC-based SRAM fabrication in 7nm that TSMC recently announced.[2]  This article summarizes the highlights of his presentation.

SRAM-based IMC

The figure below illustrates how a binary multiply operation could be implemented in an SRAM.  The “product” of an input value and a weight bit value is realized by accessing a wordline transistor (input) and a bit-cell read transistor (weight).  Only in the case where both values are ‘1’ will the series device connection conduct current from the (pre-charged) bitline, for the duration of the wordline input pulse.

In other words, the ‘1’ times ‘1’ product results in a voltage change on the bitline, dependent upon the Ids current, the bitline capacitance, and the duration of the wordline ‘1’ pulse.

The equation for the output value yi above requires a summation across the full dimension of the input vector and a row of the weight matrix.   Whereas a conventional SRAM memory read cycle activates only a single decoded address wordline, consider what happens when every wordline corresponding to an input vector bit value of ‘1’ is raised.  The figure above also presents an equation for the total bitline voltage swing as dependent on the current from all (‘1’ * ‘1’) input and weight products.

Another view of the implementation of the dot product with an SRAM array is shown below.  Note that there are two sets of wordline drivers – one set for the neural network layer input vector, and one set of for normal SRAM operation (e.g., to write the weights into the array).

Also, the traditional CMOS six-transistor (6T) bit cell is designed for a single active wordline (with restoring sense amplification for data and data_bar).  For the dot product calculation where many input wordlines could be active, an 8T cell with separate Read bitline from Write bitlines is required – the voltage swing equation above applies to the current discharging this distinct Read bitline.

The figures above are simplified, as they illustrate the vector product using ‘1’ or ‘0’ values.  As mentioned earlier, the quantized data types for low power inference are likely greater than one bit, such as INT4.  The implementation used by TSMC is unique.  The 4-bit value of the input vector entry is represented as a series of 0 to 15 wordline pulses, as illustrated below.  The cumulative discharge current on the Read bitline represents the contribution from all input pulses on each wordline row.

The multiplication product output is also an INT4 value.  The four output signals use separate bitlines   – RBL[3] through RBL[0] – as shown below.  When the product is being calculated, the pre-charged bitlines are discharged as described above.  The total capacitance on each bitline is the same – e.g., “9 units” – the parallel combination of the calculation and compensation capacitances.

After the bitline discharge is complete, the compensation capacitances are disconnected.  Note the positional weights of the computation capacitances – i.e., RBL[3] has 8 times the capacitance of RBL[0].  The figure below shows the second phase of evaluation, when the four Read bitlines are connected together.  The “charge sharing” across the four line capacitances implies that the contribution of the RBL[3] line is 8 times greater than RBL[0], representing its binary power in a 4-bit multiplicand.

In short, a vector of 4-bit input values – each represented as 0-15 pulses on a single wordline—is multiplied against a vector of 4-bit weights, and the total discharge current is used to produce a single (capacitive charge-shared) voltage at the input to an Analog-to-Digital converter.  The ADC output is the (normalized) 4-bit vector product, which is input to the bias accumulator and activation function for the neural network node.

Yih highlighted the assumption that the bitline current contribution from each active (‘1’ * ‘1’) product is the same – i.e., all active series devices will contribute the same (saturated) current during the wordline pulse duration.  In actuality, if the bitline voltage drops significantly during evaluation, the Ids currents will be less, operating in the linear region.  As a result, the quantization of a trained deep NN model will need to take this non-linearity into account when assigning weight values.  The figure below indicates that a significant improvement is classification accuracy is achieved when this corrective step is taken during quantization.

IMC with Non-volatile Memory (NVM)

In addition to using CMOS SRAM bit cells, Yih highlighted that an additional area of research is to use a Resistive-RAM (ReRAM) bit cell array to store weights, as illustrated below.  The combination of an input wordline transistor pulse with a high-R or low-R resistive cell defines the resulting bitline current.   (ideally, the ratio of the high resistance state to the low resistance state is very large.)  Although similar to the SRAM operation described above, the ReRAM array would offer much higher bit density.  Also, further fabrication research into the potential for one ReRAM bit cell to have more than two non-volatile resistive states offers even greater neural network density.

Summary

Yih’s presentation provided insights into how the architectural design of memory arrays could readily support In-Memory Computing, such as the internal product of inputs and weights fundamental to each node of a deep neural network.  The IMC approach provides a dense and extremely low-power alternative to processor plus memory implementations, with the tradeoff of quantized data representation.   It will be fascinating to see how IMC array designs evolve to support the “AI is the new electricity” demand.

-chipguy

 

References

[1]  Yih Wang, “Design Considerations for Emerging Memory and In-Memory Computing”, VLSI 2020 Symposium, Short Course 3.8.

[2]  Dong, Q., et al., “A 351 TOPS/W and 372.4 GOPS Compute-in-Memory SRAM Macro in 7nm FinFET CMOS for Machine-Learning Applications”, ISSCC 2020, Paper 15.3.

Also, please refer to:

[3] Choukroun, Y.., et al., “Low-bit Quantization of Neural Networks for Efficient Inference”, IEEE International Conference on Computer Vision, 2019, https://ieeexplore.ieee.org/document/9022167 .

[4] Agrawal, A., et al., “X-SRAM: Enabling In-Memory Boolean Computations in CMOS Static Random Access Memories”, IEEE Transactions on Circuits and Systems, Volume 65, Issue 12, December, 2018, https://ieeexplore.ieee.org/document/8401845 .

 

Images supplied by the VLSI Symposium on Technology & Circuits 2020.

 

 


A Compelling Application for AI in Semiconductor Manufacturing

A Compelling Application for AI in Semiconductor Manufacturing
by Tom Dillinger on 07-06-2020 at 6:00 am

AI opportunities

There have been a multitude of announcements recently relative to the incorporation of machine learning (ML) methods into EDA tool algorithms, mostly in the physical implementation flows.  For example, deterministic ML-based decision algorithms applied to cell placement and signal interconnect routing promise to expedite and optimize physical design results, without the iterative cell-swap placement and rip-up-and-reroute algorithms.  These quality-of-results and runtime improvements are noteworthy, to be sure.

Yet, there is one facet of the semiconductor industry that is (or soon will be) critically-dependent upon AI support – the metrology of semiconductor process characterization, both during initial process development/bring-up, and in-line inspection driving continuous process improvement.  (Webster’s defines metrology as “the application of measuring instruments and testing procedures to provide accurate and reliable measurements”.)  Every aspect of semiconductor processing, from lithographic design rule specifications to ongoing yield analysis, is fundamentally dependent upon accurate and reliable data for critical dimension (CD) lithographic patterning and material composition.

At the recent VLSI 2020 Symposium, Yi-hung Lin, Manager of the Advanced Metrology Engineering Group at TSMC, gave a compelling presentation on the current status of semiconductor metrology techniques, and the opportunities for AI methods to provide the necessary breakthroughs to support future process node development.  This article briefly summarizes the highlights of his talk. [1]

The figure below introduced Yi-hung’s talk, illustrating the sequence where metrology techniques are used.  There is an initial analysis of fabrication materials specifications and lithography targets during development.  Once the process transitions to manufacturing, in-line (non-destructive) inspection is implemented to ensure that variations are within the process window for high yield.  Over time, the breadth of different designs, and specifically, the introduction of the process on multiple fab lines requires focus on dimensional matching, wafer-to-wafer, lot-to-lot, and fab line-to-fab line.

The “pre-learning” opportunities suggest that initial process bring-up metrology data could be used as the training set for AI model development, subsequently applied in production.  Ideally, the models would be used to accelerate the time to reach high-volume manufacturing.  These AI opportunities are described in more detail below.

Optical Critical Dimension (OCD) Spectroscopy
I know some members of the SemiWiki audience fondly (or, perhaps not so fondly) recall the many hours spent in the clean room looking through a Zeiss microscope at wafers, to evaluate developed photoresist layers, layer-to-layer alignment verniers, and material etch results.  At the wavelength of the microscope light source, these multiple-micrometer features were visually distinguishable – those days are long, long gone.

Yi-hung highlighted that OCD spectroscopy is still a key source of process metrology data.  It is fast, inexpensive, and non-destructive – yet, the utilization of OCD has changed in deep sub-micron nodes.  The figure below illustrates the application of optical light sources in surface metrology.

The incident (visible, or increasingly, X-ray) wavelength is provided to a 3D simulation model of the surface, which solves electromagnetic equations to predict the scattering.  These predicted results are compared to the measured spectrum, and the model is adjusted – a metrology “solution” is achieved when the measured and EM simulation results converge.

OCD illumination is most applicable when an appropriate (1D or 2D) “optical grating-like” pattern is used for reflective diffraction of the incident light.  However, the challenge is that current surface topographies are definitely three-dimensional, and the material measures of interest do not resemble a planar grating.  Optical X-ray scatterometry provides improved analysis accuracy with these 3D topographies, but is an extremely slow method of data gathering.

Yi-hung used the term ML-OCD, to describe how an AI model derived from other metrology techniques could provide an effective alternative to the converged EM simulation approach.  As illustrated below, the ML-OCD spectral data would serve as the input training dataset for model development, with the output target being the measures from (destructive) transmission electron microscopy (TEM), to be discussed next.

ML for Transmission Electron Microscopy (TEM)
TEM utilizes a focused electron beam that is directed through a very thin sample – e.g., 100nm or thinner.  The resulting (black-and-white) image provides high-magnification detail of the material cross-section, due to the much smaller electron wavelength (1000X smaller than an optical photon).

There are two areas that Yu-hing highlighted where ML techniques would be ideal for TEM images.  The first would utilize familiar image processing and classification techniques to automatically extract CD features, especially useful for “blurred” TEM images.  The second would be to serve as the training set output for ML-OCD, as mentioned above.  Yi-hung noted that one issue to the use of TEM data for ML-OCD modeling is that a large amount of TEM sample data would required as the model output target.  (The fine resolution of the TEM image compared to the field of the incident OCD exposure exacerbates the issue.)

ML for Scanning Electron Microscopy (SEM)
The familiar SEM images measure the intensity of secondary electrons (emitted from the outer atomic electron shell) that are produced from collisions with an incident primary electron – the greater the number of SE’s generated in a local area, the brighter the SEM image.  SEMs are utilized at deep submicron nodes for (top view) line/space images, and in particular, showing areas where lithographic and material pattering process defects are present.

ML methods could be applied to SEM images for defect identification and classification, and to assist with root cause determination by correlating the defects to specific process steps.

Another scanning electron technique uses a variable range of higher-energy primary electrons, which will have different landing distances from the surface, and thus, provide secondary electrons from deeper into the material.  However, an extremely large primary energy will result in the generation of both secondary electrons and X-ray photons, as illustrated below.  (Yi-hung noted that this will limit the image usability for the electron detectors used in SEM equipment, and thus limit the material depth that could be explored – either more SE sensitivity or SE plus X-ray detector resolution will be required.)   The opportunities for a (generative) machine learning network to assist with “deep SEM” image classification are great.

Summary
Yi-hung concluded his presentation with the following breakdown of metrology requirements:

  • (high-throughput) dimensional measurement:
      • OCD, X-ray spectroscopy  (poor on 3D topography)
  • (high-accuracy, destructive) reference measurement:  TEM
  • Inspection (defect identification and yield prediction):  SEM
  • In-line monitoring (high-throughput, non-destructive):
      • hybrid of OCD + X-ray, with ML-OCD in the future?

In all these cases, there are great opportunities to apply machine learning methods to the fundamental metrology requirements of advanced process development and high-volume manufacturing.   Yi-hung repeated the cautionary tone that semiconductor engineering metrology currently does not have the volume of training data associated with other ML applications.  Nevertheless, he encouraged data science engineers potentially interested in these applications to contact him.   🙂

Yu-hing also added that there is a whole other metrology field to explore for potential AI applications – namely, application of the sensor data captured by individual pieces of semiconductor processing equipment, as it relates to overall manufacturing yield and throughput.  A mighty challenge, indeed.

-chipguy

 

References

[1]  Yi-hung Lin, “Metrology with Angstrom Accuracy Required by Logic IC Manufacturing – Challenges From R&D to High Volume Manufacturing and Solutions in the AI Era”, VLSI 2020 Symposium, Workshop WS2.3.

Images supplied by the VLSI Symposium on Technology & Circuits 2020.

 


Optimizing Chiplet-to-Chiplet Communications

Optimizing Chiplet-to-Chiplet Communications
by Tom Dillinger on 06-29-2020 at 6:00 am

bump dimensions

Summary
The growing significance of ultra-short reach (USR) interfaces on 2.5D packaging technology has led to a variety of electrical definitions and circuit implementations.  TSMC recently presented the approach adopted by their IP development team, for a parallel-bus, clock-forwarded USR interface to optimize power/performance/area – i.e., “LIPINCON”.

Introduction
The recent advances in heterogeneous, multi-die 2.5D packaging technology have resulted in a new class of interfaces – i.e., ultra-short reach (USR) – whose electrical characteristics differ greatly from traditional printed circuit board traces.  Whereas the serial communications lane of SerDes IP is required for long, lossy connections, the short-reach interfaces support a parallel bus architecture.

The SerDes signal requires (50 ohm) termination to minimize reflections and reduce far-end crosstalk, adding to the power dissipation.  The electrically-short interfaces within the 2.5D package do not require termination.  Rather than “recovering” the clock embedded within the serial data stream, with the associated clock-data recovery (CDR) circuit area and power, these parallel interfaces can use a simpler “clock-forwarded” circuit design – a transmitted clock signal is provided with a group of N data signals.

Another advantage of this interface is that the circuit design requirements for electrostatic discharge protection (ESD) between die are much reduced.  Internal package connections will have lower ESD voltage stress constraints, saving considerable I/O circuit area (and significantly reducing I/O parasitics).

The unique interface design requirements between die in a 2.5D package has led to the use of the term “chiplet”, as the full-chip design overhead of SerDes links is not required.  Yet, to date, there have been quite varied circuit and physical implementation approaches used for these USR interfaces.

TSMC’s LIPINCON interface definition
At an invited talk for the recent VLSI 2020 Symposium, TSMC presented their proposal for a parallel-bus, clock-forwarded architecture – “LIPINCON” – which is short for “low-voltage, in-package interconnect”. [1]  This article briefly reviews the highlights of that presentation.

The key parameters of the short-reach interface design are:

  • Data rate per pin:  dependent upon trace length/insertion loss, power dissipation, required circuit timing margins
  • Bus width:  with modularity to define sub-channels
  • Energy efficiency:  measured in pJ/bit, including not only the I/O driver/receiver circuits, but any additional data pre-fetch/queuing and/or encoding/decoding logic
  • “Beachfront” (linear) and area efficiencies:  measure of the aggregate data bandwidth per linear edge and area perimeter on the chiplets – i.e., Tbps/mm and Tbps/mm**2;  dependent upon the signal bump pitch, and the number and pitch of the metal redistribution layers on the 2.5D substrate, which defines the number of bump rows for which signal traces can be routed – see the figures below
  • Latency:  another performance metric; the time between the initiation of data transmit and receive, measured in “unit intervals” of the transmit cycle

Architects are seeking to maximize the aggregate data bandwidth (bus width * data rate), while achieving very low dissipated energy per bit.  These key design measures apply whether the chiplet interface is between multiple processors (or SoCs), processor-to-memory, or processor-to-I/O controller functionality.

The physical signal implementation will differ, depending on the packaging technology.  The signal redistribution layers (RDL) for a 2.5D package with silicon interposer will leverage the finer metal pitch available (e.g., TSMC’s CoWoS).  For a multi-die package utilizing the reconstituted wafer substrate to embed the die, the RDL layers are much thicker, with a wider pitch (e.g., TSMC’s InFO).  The figures below illustrate the typical signal trace shielding (and lack of shielding) associated with CoWoS and InFO designs, and the corresponding signal insertion and far-end crosstalk loss.

 

The key characteristics of the TSMC LIPINCON IP definition are illustrated schematically in the figure below.

  • A low signal swing interface of 0.3V is adopted (also saves power).
  • The data receiver uses a simple differential circuit, with a reference input to set the switching threshold (e.g., 150mV).
  • A clock/strobe signal is forwarded with (a sub-channel of) data signals;  the receiver utilizes a simple delay-locked loop (DLL) to “lock” to this clock.

Briefly, a DLL is a unique circuit – it consists of an (even-numbered) chain of identical delay cells.  The figure below illustrates an example of the delay chain. [2]   The switching delay of each stage is dynamically adjusted by modulating the voltage inputs to the series nFET and pFET devices in the input inverter of each stage – i.e., a “current-starved” inverter.  (Other delay chain implementations dynamically modify the identical capacitive load at each stage output, rather than adjusting the internal transistor drive strength of each stage.)

The “loop” in the DLL is formed by a phase detector (XOR-type logic with low-pass filter), which compares the input clock to the final output of the chain.  The leading or lagging nature of the input clock relative to the chain output adjusts the inverter control voltages – thus, the overall delay of the chain is “locked” to the input clock.  The (equal) delays of each stage in the DLL chain provides outputs that correspond to a specific phase of the input clock signal.  The parallel data is captured in receiver flops using an appropriate phase output, a means of compensating for any data-to-clock skew across the interface.

The TSMC IP team developed an innovative approach for the specific case of a SoC-to-memory interface.  The memory chiplet may not necessarily embed a DLL to capture signal inputs.  For a very wide interface – e.g., 512 addresses, 256 data bits, divided into sub-channels – the overhead of the DLL circuitry in the cost-sensitive memory chiplet would be high.  As illustrated in the figure below, the DLL phase output which serves as the input strobe for a memory write cycle is present in the SoC instead.  (The memory read path is also illustrated in the figure, illustrating how the data strobe from the memory is connected to the read_DLL circuit input.)

For the parallel LIPINCON interface, simultaneous switch noise (SSN) related to signal crosstalk is a concern.  For the shielded (CoWoS) and unshielded (InFO) RDL signal connections illustrated above, TSMC presented results illustrating very manageable crosstalk for this low-swing signaling.

To be sure, designers would have the option of developing a logical interface between chiplets that used data encoding to minimize signal transition activity in successive cycles.  The simplest method would be to add data bus inversion (DBI) coding – the data in the next cycle could be compared to the current data, and transmitted using true or inverted values to minimize the switching activity.  An additional DBI signal between chiplets carries this decision for the receiver to decode the values.

The development of heterogeneous 2.5D packaging relies upon the integration of known good die/chiplets (KGD).  Nevertheless, the post-assembly yield of the final package can be enhanced by the addition of redundant lanes which can be selected after package test (ideally, built-in self-test).  The TSMC presentation included examples of redundant lane topologies which could be incorporated into the chiplet designs.  The figure below illustrates a couple of architectures for inserting redundant through-silicon-vias (TSVs) into the interconnections.  This would be a package yield versus circuit overhead tradeoff when architecting the interface between chiplets.

In a SerDes-based design, thorough circuit and PCB interconnect extraction plus simulation is used to analyze the signal losses.  The variations in signal jitter and magnitude are analyzed against the receiver sense amp voltage differential.  Hardware lab-based probing is also undertaken to ensure a suitable “eye opening” for data capture at the receiver.  TSMC highlighted that this type of interface validation is not feasible with the 2.5D package technology.  As illustrated below, a novel method was developed by their IP team to introduce variation into the LIPINCON transmit driver and receive capture circuitry to create an equivalent eye diagram for hardware validation.

The TSMC presentation mentioned that some of their customers have developed their own IP implementations for USR interface design.  One example showed a very low swing (0.2V) electrical definition that is “ground referenced” (e.g., signal swings above and below ground).  Yet, for fabless customers seeking to leverage advanced packaging, without the design resources to “roll their own” chiplet interface circuitry, the TSMC LIPINCON IP definition is an extremely attractive alternative.  And, frankly, given the momentum that TSMC is able to provide, this definition will likely help accelerate a “standard” electrical definition among developers seeking to capture IP and chiplet design market opportunities.

For more information on TSMC’s LIPINCON definition, please follow this link.

-chipguy

 

References

[1]  Hsieh, Kenny C.H., “Chiplet-to-Chiplet Communication Circuits for 2.5D/3D Integration Technologies”,  VLSI 2020 Symposium, Paper SC2.6 (invited short course).

[2]  Jovanovic, G., et al., “Delay Locked Loop with Linear Delay Element”, International Conference on Telecommunication, 2005, https://ieeexplore.ieee.org/document/1572136

Images supplied by the VLSI Symposium on Technology & Circuits 2020.

 


Multi-Vt Device Offerings for Advanced Process Nodes

Multi-Vt Device Offerings for Advanced Process Nodes
by Tom Dillinger on 06-26-2020 at 6:00 am

Ion Ioff

Summary
As a result of extensive focus on the development of workfunction metal (WFM) deposition, lithography, and removal, both FinFET and gate-all-around (GAA) devices will offer a wide range of Vt levels for advanced process nodes below 7nm.

Introduction
Cell library and IP designers rely on the availability of nFET and pFET devices with a range of threshold voltages (Vt).  Optimization algorithms used in physical synthesis flows evaluate the power, performance, and area (PPA) of both cell “drive strength” (e.g., 1X, 2X, 4X-sized devices) and cell “Vt levels” (e.g., HVT, SVT, LVT) when selecting a specific instance to address timing, noise, and power constraints.  For example, a typical power optimization decision is to replace a cell instance with a higher Vt variant to reduce leakage power, if the timing path analysis margins allow (after detailed physical implementation).  The additional design constraints for multi-Vt cell library use are easily managed:  (1) the device Vt active area must meet (minimum) lithography area requirements, and (2) the percentage of low Vt cells used should be small, to keep leakage currents in check.

A common representation to illustrate the device Vt offerings in a particular process is to provide an I_on versus I_off characterization curve, as shown in the figure below.

Although it doesn’t reflect the process interconnect scaling options, this curve is also commonly used as a means of comparing different processes, as depicted in the figure.  A horizontal line shows the unloaded, I_on based performance gains achievable.  The vertical line illustrates the iso-performance leakage I_off power reduction between processes, for a reference-sized device in each.  Note that these lines are typically drawn without aligning to specific (nominal) Vt devices in the two process nodes.

The I_on versus I_off curve does not really represent the statistical variation in the process device Vt values.  A common model for representing this data is the Pelgrom equation. [1]  The standard deviation of (measured) device Vt data is plotted against (1 / sqrt(Weff * Lgate)):

(sigma_Vt)**2 =  (A**2) / 2 * Weff * Lgate 

       where A is a “fitting” constant for the process

Essentially , as the square root of the channel area of the device is increased, the sigma-Vt decreases.  (Consider N devices in parallel with independent Vt variation – the Vt mean of the total will be the mean of the Vt distribution, while the effective standard deviation is reduced.)  The Pelgrom plot for the technology is an indication of the achievable statistical process control – more on Vt variation shortly.

For planar CMOS technologies, Vt variants from the baseline device were fabricated using a (low impurity dose) implant into the channel region.  A rather straightforward Vt implant mask lithography step was used to open areas in the mask photoresist for the implant.  For an implant equivalent to the background substrate/well impurity type, the device Vt would be increased.  The introduction of an implant step modifying the background concentration would increase the Vt variation, as well.

With the introduction of FinFET channel devices, the precision and control of implant-based Vt adjusts became extremely difficult.  The alternative pursued for these advanced (high-K gate oxide, metal gate) process nodes is to utilize various gate materials, each with a different metal-to-oxide workfunction contact potential.

Vt offerings for advanced nodes
As device scaling continues, workfunction metal (WFM) engineering for Vt variants is faced with multiple challenges.  A presentation at the recent VLSI 2020 Symposium by TSMC elaborated upon these challenges, and highlighted a significant process enhancement to extent multi-Vt options for nodes below 7nm. [2]

The two principal factors that exacerbate the fabrication of device Vt’s at these nodes are shown in the figures below, from the TSMC presentation.

  • The scaling of the device gate length (shown in cross-section in the figure) requires that the WFM deposition into the trench be conformal in thickness, and be thoroughly removed from unwanted areas.
  • Overall process scaling requires aggressive reduction in the nFET to pFET active area spacing.  Lithographic misalignment and/or non-optimum WFM patterning may result in poor device characteristics – the figure above illustrates incomplete WFM coverage of the (fin and/or GAA) device.

Parenthetically, another concern with the transition to GAA device fabrication is the requirement to provide a conformal WFM layer on all side of each (horizontal) nanosheet, without “closing off” the gap between sheets.

The TSMC presentation emphasized the diverse requirements of HPC, AI, 5G comm., and mobile markets, which have different top priorities among the PPA tradeoffs.  As a result, despite the scaling challenges listed above, the demand for multi-Vt cell libraries and PPA optimization approaches remains strong.  TSMC presented extremely compelling results of their WFM fabrication engineering focus.  The figure below illustrates that TSMC has demonstrated a range of Vt offerings for sub-7nm nodes, wider than 7nm.  TSMC announced an overall target Vt range exceeding 250mV.  (Wow.)

In addition to the multi-Vt data, TSMC provided corresponding analysis results for the Vt variation (Pelgrom plot) and the time-dependent device breakdown (TDDB) reliability data – see the figures below.

The sigma-Vt Pelgrom coefficient is improved with the new WFM processing, approaching the 7nm node results.  The TDDB lifetime is also improved over the original WFM steps.

The markets driving the relentless progression to advanced process nodes have disparate performance, power, and area goals.  The utilization of multi-Vt device and cell library options has become an integral design implementation approach.  The innovative process development work at TSMC continues this design enablement feature, even extending this capability over the 7nm node – that’s pretty amazing.

For more information on TSMC’s advanced process nodes, please follow this link.

-chipguy

References
[1]  ] M. J. M. Pelgrom, C. J. Duinmaijer, and A. P. G. Welbers, “Matching properties of MOS transistors”, IEEE J. Solid-State Circuits, vol. 24, no. 5, pp. 1433–1440, Oct. 1989.

[2]  Chang, Vincent S., et al., “Enabling Multiple-Vt Device Scaling for CMOS Technology beyond 7nm Node”, VLSI Symposium 2020, Paper TC1.1.

Images supplied by the VLSI Symposium on Technology & Circuits 2020.

 


Effect of Design on Transistor Density

Effect of Design on Transistor Density
by Scotten Jones on 05-26-2020 at 10:00 am

TSMC N7 Density Analysis SemiWiki

I have written a lot of articles looking at leading edge processes and comparing the process density. One comment I often get are that the process density numbers I present do not correlate with the actual transistor density on released products. A lot of people want to draw conclusions an Intel’s processes versus TSMC’s processes based on Apple cell phone application processors versus Intel microprocessors, this is not a valid comparison! In this article I will review the metrics I use for transistor density and why I use them and why comparing transistor density on product designs is not valid.

The first comment I want to make is that I am not a circuit designer and therefore I am not familiar with all of the aspects of the decisions that go into creating a design that may impact the transistor density of the final product, but I do have an understanding of the difference in density that can occur across a given process.

Logic designs are made up of standard cells and the size of the standard cells is driven by 4 parameters, metal two pitch (M2P), track height (TH), contacted poly pitch (CPP) and single diffusion break (SDB) versus double diffusion break (DDB).

Cell Height
The height of a standard cell is the metal two pitch (M2P) multiplied by the number of tracks (Track Height or TH). In recent years in order to continue to shrink standard cells the TH has been reduced while simultaneously reducing M2P as part of something called design technology co-optimization (DTCO). One key aspect of reducing TH is that the number of fins per transistor must be reduced at low THs due to space constraints, this is called fin depopulation. If you reduce the number of fins per transistor you get less drive current from each transistor unless you do something else to compensate for it such as increasing fin height, therefore DTCO.

Cell Width
The width of a standard cell depends on contacted poly pitch (CPP), whether the process supports single diffusion break (SDB) or double diffusion break (DDB) and the type of cell. For example, a NAND Gate is 3 CPPs in width with a SDB and 4 CPPs in width with a DDB. On the other hand, a scanned flip flop (SFF) cell might be something like 19 CPPs wide with a SDB and 20 CPPs wide with a DDB (this can vary with SFF designs). As you can see the effect on SDB versus DDB has more affect on a NAND Cell size than on a SFF cell.

Cell Options
When discussing process density, I always compare the minimum cell size, but processes offer multiple options. For example, TSMC’s 7nm 7FF process offers a minimum cell that is a 6-track cell with 2 fins per transistor and a 9-track cell with 3 fins per transistor. The 9-trcak cell offers 1.5x the drive current as the 6-track cell but is also 1.5x the size. This illustrates one of the problems when comparing two product designs to each other as a way of characterizing transistor density, a high performance design would have more 9-track cells and therefore lower transistor density than a design targeted at minimum size or lower power with 6-track cells on the same process. Even the preponderance of NAND cells versus SFF cells would affect the transistor density.

Figure 1 summarize the density difference between 6-track and 9-track cells on the TSMC 7FF process. Please note the MTx/mm2 parameter is the million transistor per millimeter squared based on 60% NAND cells and 40% SFF cells.

Figure 1. TSMC 7FF Density Analysis

 An interesting observation from figure 1 is that a minimum area SFF cell has over 2x the transistor density of a high-performance NAND cell on the same process.  There are also many other types of standard cells with varying transistor densities.

Memory Array
Most system on a chip (SOC) circuits contain significant SRAM memory arrays, in fact it is not unusual for over half the die area to be SRAM array.

The 7FF process offer a high density 6-transistor (6T) SRAM cell that is 0.0270 microns squared in area and that works out it 222 MTx/mm2. In theory a lot of memory array area on a design could result in higher transistor density, however, as with a lot of things related to comparing process density it isn’t that simple.

While doing a project for a customer I analyzed 3 TSMC SRAM test chips and embed SRAM arrays in 4 Intel chips and 1 AMD chips. The SRAM arrays were on average 2.93x the size you would expect based on the SRAM cell size for the process and the bit capacity of the array. This is presumably due to interconnect and circuitry to access the memory. If we base transistor density for SRAM on the SRAM cells in the array the density drops to 75.84 MTx/mm2 although there are certainly some transistor in the access circuitry that this isn’t counting.

Other Circuits
Certain SOC designs may also include analog, I/O and other elements that have significantly lower transistor density than minimum cells.

Conclusion
The bottom line to all this is that if you could implement the same design, say an ARM core with the same amount of SRAM into different processes you could use actual designs to compare process density, but since that isn’t available then some type of representative metric that can be consistently applied is needed. When I compare processes, I compare transistor density for a minimum size logic cell with a 60% NAND cell/40% SFF cell ratio. This is not a perfect metric but compares processes under the same condition. I also want to mention that for processes that are in production my calculations are based on dimensions measured on the product, typically by TechInsights and are not based on information from the individual companies I am covering. I do use information from the company announcements when estimating future process density.

Also Read:

Cost Analysis of the Proposed TSMC US Fab

Can TSMC Maintain Their Process Technology Lead

SPIE 2020 – ASML EUV and Inspection Update


Cost Analysis of the Proposed TSMC US Fab

Cost Analysis of the Proposed TSMC US Fab
by Scotten Jones on 05-19-2020 at 10:00 am

TSMC US Fab SemiWiki

On May 15th TSMC “announced its intention to build and operate an advanced semiconductor fab in the United States with the mutual understanding and commitment to support from the U.S. federal government and the State of Arizona.”

The fab will run TSMC’s 5nm technology and have a capacity of 20,000 wafers per month (wpm). Construction is planned to start in 2021 and production is targeted for 2024. Total spending on the project including capital expenditure will be $12 billion dollars between 2021 and 2029.

This announcement is undoubtedly the result of intense pressure on TSMC by the US government and it is also coming out today that TSMC will stop taking orders from Huawei also under pressure from the US.

What does this fab announcement mean?

This announcement is in my opinion soft, “Intention to build”, “construction planned to start”, “production targeted”. The project is based on a “mutual understanding and commitment to support from the U.S. federal government and the State of Arizona”, What happens if Donald Trump is voted out in November or just changes his mind? I could easily see this project never materializing due to changes in the US political situation or lack of follow-through from TSMC who is likely not excited about it to begin with.

My company IC Knowledge LLC is the world leader of cost and price modeling of semiconductors and MEMS. I thought it would be interesting to use our Strategic Cost and Price Model to make some calculations around this fab.

TSMC operates four major 300mm manufacturing sites in Taiwan and one in China. The four sites in Taiwan are all GigagFab sites, Fab 12, Fab 14, Fab 15 and Fab 18 are each made up of 6 or 7 wafers fabs sharing central facility plants. This Gigafab approach is believed to reduce construction costs by about 25% versus building a single stand-alone fab. The china fab location is smaller with 2 fabs at one location but the fab was equipped with used equipment transferred from fabs in Taiwan because the fab is trailing edge. If TSMC really builds a single US fab running 20,000 wpm the resulting cost to produce a wafer will be roughly 1.3% higher than for a GigaFab location due to higher construction costs. I believe it is unlikely the site will be equipped with used equipment transferred from Taiwan. The cost to build and equip the fab for 20,000 wpm should be approximately $5.4 billion dollars.

Locating a fab in the US versus Taiwan will result in the fab incurring US labor and utility costs, this will add approximately 3.4% to the wafer manufacturing cost.

The capacity of the fab is also smaller than a “typical” fab at advanced nodes, the three 5nm fabs TSMC is operating or planning for Taiwan are all 30,000 wpm. A 20,000 wpm fab will have an approximately 3.8% increase in costs versus a 30,000 wpm fab under the same conditions.

In total, wafers produced at the TSMC Arizona fab will be approximately 7% more expensive to manufacturer than a wafer made in Fab 18 in Taiwan. This does not account for the impact of taxes that are likely to be higher in the US than in Taiwan.

In the announcement TSMC has said the total spending on the project between 2021 and 2029 would be $12 billion dollars. That leaves money for a future expansion or conversion to 3nm. That would be almost enough money to add a second 20,000 wpm fab running 3nm as one possible example.

In summary the “announced” fab would likely be TSMC’s highest cost production site. It will be interesting to see if the fab materializes.

Also Read:

Can TSMC Maintain Their Process Technology Lead

SPIE 2020 – ASML EUV and Inspection Update

SPIE 2020 – Applied Materials Material-Enabled Patterning


TSMC’s Advanced IC Packaging Solutions

TSMC’s Advanced IC Packaging Solutions
by Herb Reiter on 05-01-2020 at 10:00 am

Fig 3 TSMC Adv Pkg blog

TSMC as Pure Play Wafer Foundry
TSMC started its wafer foundry business more than 30 years ago. Visionary management and creative engineering teams developed leading-edge process technologies and their reputation as trusted source for high-volume production. TSMC also recognized very early the importance of building an ecosystem – to complement the company’s own strengths. Their Open Innovation Platform (OIP) attracted many EDA and IP partners to contribute to TSMC’s success, all following Moore’s Law, to 3 nm at this time, to serve very high-volume applications.

Markets need Advanced IC Packaging technologies
For many other applications Moore’s Law is no longer cost-effective, especially not for integration of heterogeneous functions. “Moore than Moore” technologies, like Multi-chip modules (MCMs) and System in Package (SiP) have become alternatives for integrating large amounts of logic and memory, analog, MEMS, etc. into (sub)system solutions. However, these methodologies were and still are very customer specific and incur significant development time and cost.

In response to market needs for new multi-die IC packaging solutions, TSMC has developed, in cooperation with OIP partners, advanced IC packaging technologies to offer economical solutions for More than Moore integration.

TSMC as supplier of Advanced IC Packaging solutions
In 2012 TSMC introduced, together with Xilinx, the by far largest FPGA available at that time, comprised of four identical 28 nm FPGA slices, mounted side-by-side, on a silicon interposer. They also developed through-silicon-vias (TSVs), micro-bumps and re-distribution-layers (RDLs) to interconnect these building blocks. Based on its construction, TSMC named this IC packaging solution Chip-on-Wafer-on-Substrate (CoWoS). This building blocks-based and EDA-supported packaging technology has become the de-facto industry standard for high-performance and high-power designs. Interposers, up to three stepper fields large, allow combining multiple die, die-stacks and passives, side by side, interconnected with sub-micron RDLs. Most common applications today are combinations of a CPU/GPU/TPU with one or more high bandwidth memories (HBMs).

In 2017 TSMC announced the Integrated FanOut technology (InFO). It uses, instead of the silicon interposer in CoWoS, a polyamide film, reducing unit cost and package height, both important success criteria for mobile applications. TSMC has already shipped tens of millions of InFO designs for use in smartphones.

In 2019 TSMC introduced the System on Integrated Chip (SoIC) technology. Using front-end (wafer-fab) equipment, TSMC can align very accurately, then compression-bond designs with many narrowly pitched copper pads, to further minimize form-factor, interconnect capacitance and power.

Figure 1 shows that CoWoS technology is targeting Cloud, AI, Networking, Datacenters and other high-performance and high-power computing applications.

InFO serves some of these and a broad range of other, typically more cost-sensitive and lower power markets.

SoIC technology offers multi-die building blocks for integration in CoWoS and/or InFO designs. – see Figure 2.

SoIC technology benefits
TSMC’s latest innovation, the SoIC technology is a very powerful way for stacking multiple dice into a “3D building block” (a.k.a. “3D-Chiplet”). Today SoICs enable about 10,000 interconnects per mm2 between vertically stacked dice. Development efforts towards 1 Million interconnects per mm2 are ongoing. 3D-IC enthusiasts, including myself, have been looking, for an IC packaging methodology that enables such fine-grain interconnects, further reduces form-factor, eliminates bandwidth limitations, simplifies heat management in die stacks and makes integrating large, highly parallel systems into an IC package practical. As its name – System on IC – suggests, this technology meets these challenging requirements. The impressive capabilities of SoIC and SoIC+ are further explained here. TSMC’s EDA partners are working on complementing this technology with user-friendly design methodologies. I expect IP partners to offer soon SoIC ready chiplets and simulation models for user-friendly integration into CoWoS and InFO designs.

Personal comment: More than 20 years ago, in my alliance management role at Synopsys, I had the opportunity to contribute to Dr. Cliff Hou’s pioneering development work on TSMC’s initial process design kits (PDKs) and reference design flows, to facilitate the transition from the traditional IDM to the much more economical fabless IC vendor business model.

With the above described packaging technologies, TSMC is pioneering another change to the semiconductor business. CoWoS, InFO and especially SoIC enable semiconductor and system vendors to migrate from today’s lower complexity (and lower value) component-level ICs, to very high complexity and high value system-level solutions in IC packages. Last, but not least, these three advanced IC packaging solutions are accelerating an important industry trend: A big portion of the IC and system value creation is shifting from the die to the package.