Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/index.php?threads/nvidia-introduces-blackwell-800mm2-reticle-limit-n4p-dies.19856/
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021370
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

Nvidia introduces Blackwell (800mm2 reticle limit N4P dies)

Xebec

Well-known member
NVIDIA Blackwell powers a new era of computing, enabling organizations everywhere to build and run real-time generative AI on trillion-parameter large language models.
  • New Blackwell GPU, NVLink and Resilience Technologies Enable Trillion-Parameter-Scale AI Models
  • New Tensor Cores and TensorRT- LLM Compiler Reduce LLM Inference Operating Cost and Energy by up to 25x
  • New Accelerators Enable Breakthroughs in Data Processing, Engineering Simulation, Electronic Design Automation, Computer-Aided Drug Design and Quantum Computing
  • Widespread Adoption by Every Major Cloud Provider, Server Maker and Leading AI Company
SAN JOSE, Calif., March 18, 2024 (GLOBE NEWSWIRE) -- Powering a new era of computing, NVIDIA today announced that the NVIDIA Blackwell platform has arrived — enabling organizations everywhere to build and run real-time generative AI on trillion-parameter large language models at up to 25x less cost and energy consumption than its predecessor.

The Blackwell GPU architecture features six transformative technologies for accelerated computing, which will help unlock breakthroughs in data processing, engineering simulation, electronic design automation, computer-aided drug design, quantum computing and generative AI — all emerging industry opportunities for NVIDIA.

“For three decades we’ve pursued accelerated computing, with the goal of enabling transformative breakthroughs like deep learning and AI,” said Jensen Huang, founder and CEO of NVIDIA. “Generative AI is the defining technology of our time. Blackwell is the engine to power this new industrial revolution. Working with the most dynamic companies in the world, we will realize the promise of AI for every industry.”

Among the many organizations expected to adopt Blackwell are Amazon Web Services, Dell Technologies, Google, Meta, Microsoft, OpenAI, Oracle, Tesla and xAI.

Sundar Pichai, CEO of Alphabet and Google: “Scaling services like Search and Gmail to billions of users has taught us a lot about managing compute infrastructure. As we enter the AI platform shift, we continue to invest deeply in infrastructure for our own products and services, and for our Cloud customers. We are fortunate to have a longstanding partnership with NVIDIA, and look forward to bringing the breakthrough capabilities of the Blackwell GPU to our Cloud customers and teams across Google, including Google DeepMind, to accelerate future discoveries.”

Andy Jassy, president and CEO of Amazon: “Our deep collaboration with NVIDIA goes back more than 13 years, when we launched the world’s first GPU cloud instance on AWS. Today we offer the widest range of GPU solutions available anywhere in the cloud, supporting the world’s most technologically advanced accelerated workloads. It's why the new NVIDIA Blackwell GPU will run so well on AWS and the reason that NVIDIA chose AWS to co-develop Project Ceiba, combining NVIDIA’s next-generation Grace Blackwell Superchips with the AWS Nitro System's advanced virtualization and ultra-fast Elastic Fabric Adapter networking, for NVIDIA's own AI research and development. Through this joint effort between AWS and NVIDIA engineers, we're continuing to innovate together to make AWS the best place for anyone to run NVIDIA GPUs in the cloud.”

Michael Dell, founder and CEO of Dell Technologies: “Generative AI is critical to creating smarter, more reliable and efficient systems. Dell Technologies and NVIDIA are working together to shape the future of technology. With the launch of Blackwell, we will continue to deliver the next-generation of accelerated products and services to our customers, providing them with the tools they need to drive innovation across industries.”

Demis Hassabis, cofounder and CEO of Google DeepMind: “The transformative potential of AI is incredible, and it will help us solve some of the world’s most important scientific problems. Blackwell’s breakthrough technological capabilities will provide the critical compute needed to help the world’s brightest minds chart new scientific discoveries.”

Mark Zuckerberg, founder and CEO of Meta: “AI already powers everything from our large language models to our content recommendations, ads, and safety systems, and it's only going to get more important in the future. We're looking forward to using NVIDIA's Blackwell to help train our open-source Llama models and build the next generation of Meta AI and consumer products.”

Satya Nadella, executive chairman and CEO of Microsoft: “We are committed to offering our customers the most advanced infrastructure to power their AI workloads. By bringing the GB200 Grace Blackwell processor to our datacenters globally, we are building on our long-standing history of optimizing NVIDIA GPUs for our cloud, as we make the promise of AI real for organizations everywhere.”

Sam Altman, CEO of OpenAI: “Blackwell offers massive performance leaps, and will accelerate our ability to deliver leading-edge models. We’re excited to continue working with NVIDIA to enhance AI compute.”

Larry Ellison, chairman and CTO of Oracle: "Oracle’s close collaboration with NVIDIA will enable qualitative and quantitative breakthroughs in AI, machine learning and data analytics. In order for customers to uncover more actionable insights, an even more powerful engine like Blackwell is needed, which is purpose-built for accelerated computing and generative AI.”

Elon Musk, CEO of Tesla and xAI: “There is currently nothing better than NVIDIA hardware for AI.”
Named in honor of David Harold Blackwell — a mathematician who specialized in game theory and statistics, and the first Black scholar inducted into the National Academy of Sciences — the new architecture succeeds the NVIDIA Hopper™ architecture, launched two years ago.

Blackwell Innovations to Fuel Accelerated Computing and Generative AI
Blackwell’s six revolutionary technologies, which together enable AI training and real-time LLM inference for models scaling up to 10 trillion parameters, include:
  • World’s Most Powerful Chip — Packed with 208 billion transistors, Blackwell-architecture GPUs are manufactured using a custom-built 4NP TSMC process with two-reticle limit GPU dies connected by 10 TB/second chip-to-chip link into a single, unified GPU.
  • Second-Generation Transformer Engine — Fueled by new micro-tensor scaling support and NVIDIA’s advanced dynamic range management algorithms integrated into NVIDIA TensorRT™-LLM and NeMo Megatron frameworks, Blackwell will support double the compute and model sizes with new 4-bit floating point AI inference capabilities.
  • Fifth-Generation NVLink — To accelerate performance for multitrillion-parameter and mixture-of-experts AI models, the latest iteration of NVIDIA NVLink® delivers groundbreaking 1.8TB/s bidirectional throughput per GPU, ensuring seamless high-speed communication among up to 576 GPUs for the most complex LLMs.
  • RAS Engine — Blackwell-powered GPUs include a dedicated engine for reliability, availability and serviceability. Additionally, the Blackwell architecture adds capabilities at the chip level to utilize AI-based preventative maintenance to run diagnostics and forecast reliability issues. This maximizes system uptime and improves resiliency for massive-scale AI deployments to run uninterrupted for weeks or even months at a time and to reduce operating costs.
  • Secure AI — Advanced confidential computing capabilities protect AI models and customer data without compromising performance, with support for new native interface encryption protocols, which are critical for privacy-sensitive industries like healthcare and financial services.
  • Decompression Engine — A dedicated decompression engine supports the latest formats, accelerating database queries to deliver the highest performance in data analytics and data science. In the coming years, data processing, on which companies spend tens of billions of dollars annually, will be increasingly GPU-accelerated.
A Massive Superchip
The NVIDIA GB200 Grace Blackwell Superchip connects two NVIDIA B200 Tensor Core GPUs to the NVIDIA Grace CPU over a 900GB/s ultra-low-power NVLink chip-to-chip interconnect.

For the highest AI performance, GB200-powered systems can be connected with the NVIDIA Quantum-X800 InfiniBand and Spectrum™-X800 Ethernet platforms, also announced today, which deliver advanced networking at speeds up to 800Gb/s.

The GB200 is a key component of the NVIDIA GB200 NVL72, a multi-node, liquid-cooled, rack-scale system for the most compute-intensive workloads. It combines 36 Grace Blackwell Superchips, which include 72 Blackwell GPUs and 36 Grace CPUs interconnected by fifth-generation NVLink. Additionally, GB200 NVL72 includes NVIDIA BlueField®-3 data processing units to enable cloud network acceleration, composable storage, zero-trust security and GPU compute elasticity in hyperscale AI clouds. The GB200 NVL72 provides up to a 30x performance increase compared to the same number of NVIDIA H100 Tensor Core GPUs for LLM inference workloads, and reduces cost and energy consumption by up to 25x.

The platform acts as a single GPU with 1.4 exaflops of AI performance and 30TB of fast memory, and is a building block for the newest DGX SuperPOD.

NVIDIA offers the HGX B200, a server board that links eight B200 GPUs through NVLink to support x86-based generative AI platforms. HGX B200 supports networking speeds up to 400Gb/s through the NVIDIA Quantum-2 InfiniBand and Spectrum-X Ethernet networking platforms.

Global Network of Blackwell Partners
Blackwell-based products will be available from partners starting later this year.

AWS, Google Cloud, Microsoft Azure and Oracle Cloud Infrastructure will be among the first cloud service providers to offer Blackwell-powered instances, as will NVIDIA Cloud Partner program companies Applied Digital, CoreWeave, Crusoe, IBM Cloud and Lambda. Sovereign AI clouds will also provide Blackwell-based cloud services and infrastructure, including Indosat Ooredoo Hutchinson, Nebius, Nexgen Cloud, Oracle EU Sovereign Cloud, the Oracle US, UK, and Australian Government Clouds, Scaleway, Singtel, Northern Data Group's Taiga Cloud, Yotta Data Services’ Shakti Cloud and YTL Power International.

GB200 will also be available on NVIDIA DGX™ Cloud, an AI platform co-engineered with leading cloud service providers that gives enterprise developers dedicated access to the infrastructure and software needed to build and deploy advanced generative AI models. AWS, Google Cloud and Oracle Cloud Infrastructure plan to host new NVIDIA Grace Blackwell-based instances later this year.

Cisco, Dell, Hewlett Packard Enterprise, Lenovo and Supermicro are expected to deliver a wide range of servers based on Blackwell products, as are Aivres, ASRock Rack, ASUS, Eviden, Foxconn, GIGABYTE, Inventec, Pegatron, QCT, Wistron, Wiwynn and ZT Systems.

Additionally, a growing network of software makers, including Ansys, Cadence and Synopsys — global leaders in engineering simulation — will use Blackwell-based processors to accelerate their software for designing and simulating electrical, mechanical and manufacturing systems and parts. Their customers can use generative AI and accelerated computing to bring products to market faster, at lower cost and with higher energy efficiency.

NVIDIA Software Support
The Blackwell product portfolio is supported by NVIDIA AI Enterprise, the end-to-end operating system for production-grade AI. NVIDIA AI Enterprise includes NVIDIA NIM™ inference microservices — also announced today — as well as AI frameworks, libraries and tools that enterprises can deploy on NVIDIA-accelerated clouds, data centers and workstations.

To learn more about the NVIDIA Blackwell platform, watch the GTC keynote and register to attend sessions from NVIDIA and industry leaders at GTC, which runs through March 21.
 
Hitting the reticle limit is more than a little concerning. I’d hope nvidia is deep into development of disaggregated dies
All of their DC products have been at or near the ret limit for the better part of a decade. To me this is hardly concerning. Also the chip is disaggregated. Blackwell is two ret limited dies bridged together (assuming I understand their presentation properly).

Something with many homogenous dies like AMD does burns a large percentage of your total silicon area on die to die PHYs. Even with how mature the defectivity on N4P is at this point, with how big these dies are it is almost certainly cheaper to maybe split it from 2 dies to 4 or 8 dies. Beyond that I'm going to guess that going to smaller and smaller dies eats up too much area to the point the better yield no longer offsets the added cost.

If I try to put myself in a GPU designer's shoes, I think even if it costs more doing only 2 dies sharing the shorter side, the other benefits outweigh said cost. More specifically because you have the whole top, bottom, and outer edge shorelines free you can slap down more memory controllers and put more HBM on the package. Considering how memory bound these devices are, that extra bandwidth and capacity is likely a LARGE value add. Combining this with the prices and margins NVIDIA can command for these products, and I'd bet that the cost to manufacture is almost assuredly taking a backseat to the extra performance they can extract.
 
Last edited:
All of their DC products have been at or near the ret limit for the better part of a decade. To me this is hardly concerning. Also the chip is disaggregated. Blackwell is two ret limited dies bridged together (assuming I understand their presentation properly).

Something with many homogenous dies like AMD does burns a large percentage of your total silicon area on die to die PHYs. Even with how mature the defectivity on N4P is at this point, with how big these dies are it is almost certainly cheaper to maybe split it to 4 or 8 dies. Beyond that I'm going to guess that going to smaller and smaller dies eats up too much area to the point the better yield no longer offsets the added cost.

If I try to put myself in a GPU designer's shoes, I think even if it costs more doing only 2 dies sharing the shorter side, the other benefits outweigh said cost. More specifically because you have the whole top, bottom, and outer edge shorelines free you can slap down more memory controllers and put more HBM on the package. Considering how memory bound these devices are, that extra bandwidth and capacity is likely a LARGE value add. Considering the prices and margins NVIDIA can command for these products the cost to manufacture almost assuredly takes a backseat to the extra performance they can extract.
Great points. I guess I was thinking about things in the traditionally price sensitive context. With such dominance nvidia can charge whatever they want for these
 
Great points. I guess I was thinking about things in the traditionally price sensitive context. With such dominance nvidia can charge whatever they want for these
The NVIDIA dominance thing is part of it. But this logic applies to all datacenter xPUs/ASICs me thinks. For them the main costs are powering and building the data center that said chips will get slotted into. If you can increase computational efficiency, then their total cost of ownership decreases. My understanding is if AMD made just Genoa but it used half the energy, they could charge over double the price and still sell like hotcakes due to the CPU being a one time expense and the power being something you are stuck with for many years. The other way they can provide value is increasing the performance per rack. More performance means each datacenter can then generate more revenue. It's very different from a consumer product like a DIY CPU. For products like that I just want the cheapest thing that meets my needs since my TCO is mostly the cost of the product itself and any extra performance will not be valued as heavily past some threshold.
 
Last edited:
Great points. I guess I was thinking about things in the traditionally price sensitive context. With such dominance nvidia can charge whatever they want for these
Price sensitive.... LOL. Jensen said the first board costs 10B .... but the second board only costs 5B....
He was only exaggerating a little :LOL: :ROFLMAO: :LOL: :ROFLMAO:
Someday AI will need to be cost effective. Today is not that day
 
Something with many homogenous dies like AMD does burns a large percentage of your total silicon area on die to die PHYs.
Not very big. This paper from Nvidia probably describes the pre-production version of their chip to chip fabric Phy: https://ieeexplore.ieee.org/document/10011563
I'm going to guess that going to smaller and smaller dies eats up too much area to the point the better yield no longer offsets the added cost.
I think your first remark - Nvidia have already mastered large dies - combined with the maturity of N4 explains why they stick with what works.
you have the whole top, bottom, and outer edge shorelines free you can slap down more memory controllers and put more HBM on the package. Considering how memory bound these devices are, that extra bandwidth and capacity is likely a LARGE value add.
Actually they chose to zip the chips together on a long edge. That probably makes the packaging easier with a 6:4 ratio for the core chips instead of 8:3. Adding the HBMs at top and bottom looks almost square. Interesting that they actually reduced the ratio of HBMs per full reticle GPU (just 4 now) though they are faster and will have more capacity. Using the long edge for fabric zipper gives them more effective unification of the two halves, less queuing means lower latency an better illusion of a unitary device.

They also built a memory extension tier in with 480GB of LPDDR on the Grace chip. All of which will likely prove fascinating to folks trying to optimize their algorithms.
 
Not very big. This paper from Nvidia probably describes the pre-production version of their chip to chip fabric Phy: https://ieeexplore.ieee.org/document/10011563

I think your first remark - Nvidia have already mastered large dies - combined with the maturity of N4 explains why they stick with what works.
Percentage was the wrong word. A better way to describe it was if you have I don't know let's call it 5-10% of the die is your die to die communications, then when you multiply it by say 12 dies (and even more than that if you want to count AMD's big server IO dies), you effectively have a full die worth of these connections. Anyways, I am interested to get your thoughts on the matter. Is this thought process wrong? Or is it fine but in all likelihood NVIDIA is just going with what they know/not over doing it with added complexity, with my logic being a secondary/tertiary benefit?
1711025627041.png
1711025685674.png

Actually they chose to zip the chips together on a long edge. That probably makes the packaging easier with a 6:4 ratio for the core chips instead of 8:3. Adding the HBMs at top and bottom looks almost square. Interesting that they actually reduced the ratio of HBMs per full reticle GPU (just 4 now) though they are faster and will have more capacity. Using the long edge for fabric zipper gives them more effective unification of the two halves, less queuing means lower latency an better illusion of a unitary device.
Good catch
They also built a memory extension tier in with 480GB of LPDDR on the Grace chip. All of which will likely prove fascinating to folks trying to optimize their algorithms.
It will never stop being funny to me that Grace is a CPU that basically acts as a memory controller/northbridge. I have to assume they NVIDIA can do better than that for a future generation. Either slimming grace down into some more simple dedicated ASIC or getting some native AI acceleration on the chip or something so maybe it can break up tasks for the GPU or do some of the more computationally complex parts of an algorithm while GPU does the more geometric stuff? But maybe what I said makes no sense at all.
 
Last edited:
How so? This quote was all I found
“Elon Musk commented: “There is currently nothing better than Nvidia hardware for AI.”

Was that one of Elon's many deleted Tweets?

Jensen spoke at the Synopsys User's group yesterday. I have never seen that guy happier than he is today. He told a funny story about when they first started up Nvidia he traded 250,000 shares of NVDA stock for access to the Synopsys tools, which was quite common back in the day. That stock today would be worth many billions of dollars. 😂 He also highlighted the importance of partners saying he could not have succeeded at Nvidia without Synopsys. Jensen has also stated the same about their partnership with TSMC. Both TSMC and Synopsys being#1 companies. There was a press release issued as an example of the joint partnership:

 
let's call it 5-10% of the die is your die to die communications, then when you multiply it by say 12 dies (and even more than that if you want to count AMD's big server IO dies),
There are two situations, chiplet is at the end of a spoke, or chiplet is in the middle of a fabric. At the end of a spoke the bandwidth is in proportion to the chiplet. In or out, the BW depends on the resources on that chip. Generally smaller chip, smaller interconnect needed, same proportion of area. Example of this: the CCX in AMD CPUs.

In the middle of the fabric traffic may be just passing by. Breaking into many chiplets can waste increasing proportions on edge connections to allow the fabric to pass. So, fabric chips tend to want to be big. You could argue that Blackwell's pair of dies is topological end of spoke, but actually a lot of traffic is going to the HBMs or the external NVlinks, from both sides, so that zipper is a fabric. If you broke the GPU into 4 chips it becomes clearer that the fabric load is almost as big on each quarter as it was on each half, so you start to lose proportionally.

Another interesting variant is the AMD MI300x where they have end of spoke chiplets sitting *on top* of the large base dies. The base dies function as active interposers, providing fabric, cache, memory interfaces, and IO - just like the IO dies in Rome/Milan/Genoa, but now switched from horizontal to vertical interconnect (and heat flow, what fun!). And you can see they keep the fabric on the big dies (although they do go for quarters, not halves).
It will never stop being funny to me that Grace is a CPU that basically acts as a memory controller/northbridge.
:)
I have to assume they NVIDIA can do better than that for a future generation. Either slimming grace down into some more simple dedicated ASIC or getting some native AI acceleration on the chip or something so maybe it can break up tasks for the GPU or do some of the more computationally complex parts of an algorithm while GPU does the more geometric stuff? But maybe what I said makes no sense at all.
There seem several obvious upgrades ahead, using this GPU. As you say, a better Grace. Hopper had a 1:1 ratio to Grace, while each Blackwell socket gets just 1/2 a Grace, and each reticle-size is down to 1/4. That seems like an obvious placeholder waiting for a bandwidth upgrade, especially as it seems to be the pivot in a coherent domain. Whether it needs more functionality is unclear, though we might expect something looking more like an NVswitch with the memory interfaces and a few supervisory cores, using the all-reduce/broadcast functionality we see in NVswitch. Grace dates from an era when Nvidia seemed interested in competing for datacenter servers but I suspect it is less important to them now, simply they did not have the replacement ready on schedule for the first Blackwell. And maybe easing the upgrade path from GH200 - especially when we think what must have looked desirable in planning 2 or 3 years ago.

The other obvious upgrade is to HBM4 which will multiply memory bandwidth about 1.6x in same package size, which again answers a question of why they only thought they needed 4 HBMs per reticle.
 
BTW, what's going to happen if we start using High NA EUVs later? Can we still get 800mm2~ish single chip? Since AI wants BF-chips, it's little concerning I guess..
 
Back
Top