Nvidia Blackwell Design Flaw Theories Debunked

Daniel Nenni · Sep 7, 2024

This is from Mark Hibben who I have read for years and have grown to trust and respect. The other so called analsystes who dumped on TSMC and Nvidia for the delay are click baiters and not to be trusted, my opinion.

My take on the Blackwell delay
On August 2, The Information released a report that Nvidia’s Blackwell would be delayed by at least a quarter due to unspecified “design flaws.” Normally, when one hears about the design flaws of a chip, logic design problems or bugs come to mind. But Nvidia management made clear that this was not the case during their fiscal Q2 results conference call.

Instead, there was an issue with a mask that impacted chip yield, said CFO Colette Kress:

Hopper demand is strong and Blackwell is widely sampling. We executed a change to the Blackwell GPU mask to improve production yields. Blackwell production ramp is scheduled to begin in the fourth quarter and continue into fiscal year '26.

Presumably, the chip yield affects the number of good chips resulting from each silicon wafer fabricated by foundry partner TSMC (TSM). CEO Jensen Huang reiterated that there was nothing wrong with the functional design of the chip. The mask change did not change the functional logic of the chip.

Developing the mask to implement a given layer of circuitry on an advanced chip has become an enormously complex process. This is because at the extreme ultraviolet wavelength of 13.5 nm, diffraction effects and optical distortion mean that the light pattern produced on the chip doesn’t look like the mask.

Optical physicists try to work backwards from the desired pattern to predict what the mask should look like, depending on the EUV machine and other factors. The process is computationally intensive, and TSMC has invested in an Nvidia supercomputer to perform what is called “computational lithography.”

The process isn’t perfect, and the actual pattern produced by a mask may not match the computational lithography prediction. Since the issue is one of yield, this is probably what happened in the case of Blackwell, and why it’s going to take several months to find a more optimal mask solution and begin mass production.

Asa Fitch’s article in the WSJ reported on the Blackwell delay:

Nvidia hasn’t detailed the nature of the issue. But analysts and industry executives say its engineering challenges stem mostly from the size of the Blackwell chips, which require a significant departure in design.

I disagree with this interpretation. Nvidia has been making its flagship GPU accelerators (such as Hopper), at TSMC’s reticle limit, for years. This sets the maximum physical size of a chip that TSMC can produce using EUV lithography machines produced by ASML Holding (ASML).

Blackwell does consist of two such chips, but the process to make each is essentially unchanged from Hopper. Blackwell is fabbed, using basically the same TSMC N4 process as Hopper. Fitch continues:

Instead of one big piece of silicon, Blackwell consists of two advanced new Nvidia processors and numerous memory components joined in a single, delicate mesh of silicon, metal and plastic.
The manufacturing of each chip has to be close to perfect: serious defects in any one part can spell disaster, and with more components involved, there is a greater chance of that happening.

Once again, I have a different interpretation. The fact that the package consists of two chips doesn’t make the chips harder to make. Each chip already needs to be “close to perfect.” The added complexity is in packaging, not in making the silicon. And it certainly doesn’t have anything to do with the mask issue.

As I pointed out in my investing group article on the Blackwell debut back in March, the approach had already been pioneered by Apple (AAPL) and TSMC in their M-series Ultra processors.

https://seekingalpha.com/article/4719579-nvidia-the-blackwell-delay-and-its-consequences?

KevinK · Sep 7, 2024

Daniel Nenni said:
As I pointed out in my investing group article on the Blackwell debut back in March, the approach had already been pioneered by Apple (AAPL) and TSMC in their M-series Ultra processors.

Mark's generally pretty good, but I think there are some differences between the InFO_LSI packaging Apple uses for the M-Series Ultra and the CoWoS®-L used in Blackwell ?? If nothing else, the substrate size and number / scale of the embedded silicon bridges is greatly expanded, plus the chip first vs. chip last (CoWoS) approach would make for greater challenges. I wouldn't be surprised if there was some playing with the pads/contact layers to deal with new-ish thermal-mechanical issues.

peterdb · Sep 9, 2024

I'd like to add my perspective to the "mask problem". Before retiring two years ago I spent 45 years in the photomask industry, working for one captive and one merchant mask maker, two mask lithography equipment manufacturers and an EDA company focusing on the manufacturing side of EDA. I've been product manager for OPC Services, Mask Process Compensation (MPC) software, mask and wafer defect management software and briefly for Litho Friendly Design (LFD) software.

Design rules include manufacturing rules which are meant to ensure that a design is manufacturable at the foundry. As lithography becomes more complex (think EUV, reflective masks, multi-patterning, phase shifting, custom illuminators, OPC, MPC, etc.) design rules on their own no longer accurately predict printability on wafer. Model-based verification approaches are used in conjunction with design rules to improve litho yield predictability (call it LFD for short) but are limited in accuracy because the models used for LFD are not the exact models used by the foundry for various reasons including IP protection and simulation costs. Often the first exposures of critical masks for a new design reveal "hotspots" where disconnects between design verification and real lithography exist. A typical approach is to do die-to-die optical inspection to find hotspots which are revealed by their variability across the wafer, signifying areas that are sensitive to focus and exposure (process window) variability. Samples are inspected at higher resolution using electron beam imaging. Hotspots are fixed with the use of localized OPC methods that don't require redoing OPC on the entire layout, only addressing the regions of low yield without adversely affecting the rest of the layout. For first designs of a new technology node it is possible that design rules might need to be refined for future designs to minimize litho yield impacts. The litho yield improvement process is iterative, with the possibility that more than one revised mask might be required to achieve acceptable litho yield, and note that for advanced technology there are many critical mask layers. This flow might be necessary on both chips that constitute Blackwell. Time to new mask is crucial since it directly impacts time to yield. Take note that TSMC is known to have mask shops at each fab, presumably to minimize this iteration time. In the nVidia Blackwell case, if the comment about a bad mask is accurate, the process I've laid out above is possibly what is going on. It is a normal process but less predictable for leading edge layouts at advanced nodes. My personal belief is that history has shown that TMSC are masters of this process and will predictably resolve the yield issue.

KevinK · Sep 9, 2024

peterdb said:
Often the first exposures of critical masks for a new design reveal "hotspots" where disconnects between design verification and real lithography exist.

Absolutely, both for new designs and new process / packaging nodes. You don’t discover most of the secondary causes of reliability and manufacturability issues until you peel back the primary ones.

peterdb said:
For first designs of a new technology node it is possible that design rules might need to be refined for future designs to minimize litho yield impacts.

Also true for new packaging technology. The DRC decks for the RDL layers, pad layers, interposer interconnect and chip to interposer might seem less complicated than for 3nm on-chip, and somewhat free from all the complications of litho enhancement, but those layers have their own complications - things like alignment between chip and interposer, thermal and the infamous thermal differential and expansion challenges.

peterdb said:
My personal belief is that history has shown that TMSC are masters of this process and will predictably resolve the yield issue.

agreed - The process is iterative for a new technology, no matter how well everything is modeled. TSMC is the master of pushing through to the other side.

Search

Nvidia Blackwell Design Flaw Theories Debunked

Daniel Nenni

Admin

KevinK

Well-known member

peterdb

New member

KevinK

Well-known member