Google’s 400,000-Chip Monster Tensor Processing Unit Just Destroyed NVIDIA's Future!

KevinK · Dec 11, 2025

benb said:
Newbie question: Is the infrastructure cost of AI partly related to inefficient coding languages like Python or PyTorch? Is efficient coding (which we glimpsed but didn't really understand with DeepSeek) something they do in China but not the USA?

My take is that Python and PyTorch both rely on well-optimized underlying application-specific and hardware-optimized packages that are tuned in the underlying native "dataflow assembly code" (CUDA stuff for NVIDIA, ROCM for AMD, etc.). DeepSeek identified some great areas for optimization - model (multi headed latent attention), model partitioning (prefill/decode disaggregation) and system-level communication, and improved them at mostly the low level (not via packages).

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative...

arxiv.org

I don't know about other HW suppliers, but NVIDIA has created new system infrastructure inside Dynamo, plus worked on the three main model serving engines that support PyTorch, SGLang, vLLM, and TensorRT-LLM, to implement these DeepSeek optimizations, and more, in the underlying packages and architecture for inference serving with NVIDIA hardware.

So Python and PyTorch need help be fast and efficient with new models and hardware. DeepSeek mostly optimized at low level, but NVIDIA and associated developers followed up to make the optimizations accessible to all, through mostly open source.

blueone · Dec 11, 2025

benb said:
Newbie question: Is the infrastructure cost of AI partly related to inefficient coding languages like Python or PyTorch? Is efficient coding (which we glimpsed but didn't really understand with DeepSeek) something they do in China but not the USA?

This is actually not a newbie question, it's very good one.

First, the simple stuff. PyTorch and Python are related only because PyTorch uses Python-similar syntax in its programming interface. The underlying layers of software PyTorch uses to interface with GPUs and TPUs is highly tuned code, from what I can tell usually C or C++. So the notion that using PyTorch implies an inefficient GPU/TPU interface doesn't appear to be accurate.

Regarding my negativity about Python as a programming language... I confess that I'm a high performance software purist, and I don't like programming languages that are interpreted (meaning, they are not compiled into processor-specific assembler code), or use a "virtual machine" (Java is another example) which translates and executes the programming language instructions into run-time subroutines. And Python is also a member of the class of programming languages which are called "memory safe", in which dynamic memory allocation and deallocation is managed by a compiler-defined memory manager, and these memory managers usually use a single-threaded data structure called a "heap" to manage memory segments and assignments. And background "garbage collection" for dynamically-deallocated memory areas to put them back in the heaps.

Talk about inefficient! The cloud companies love their memory-safe languages, like Rust (Microsoft), though at least Rust has an "unsafe" mode for high performance programming. Google invented a language called Go, which is also memory-safe, though considered less safe than Rust, but Go does appear to have a "secret" unsafe mode too.

DeepSeek did a lot of optimizations, not all of which are open for examination, but they did include controlling parallelism in PTX for Nvidia, and minimizing data copies and communications inefficiencies, which is common in network drivers. (In the networking world the ultimate in reducing data copies has been available for decades. It's called Remote Direct Memory Access, or RDMA, a capability widely used by Nvidia in Ethernet and InfiniBand.)

The reason most teams outside of China do not go as deep as DeepSeek did is because they focus on the models, training, and inference, and depend on a higher performance hardware to get the response times they need. The US restricted China from buying the best AI processors, so they had to get more creative with classic software tuning efficiency tricks. Not to slight the DeepSeek accomplishments in any way, but I think that's all there is to it. As the saying goes, necessity is the mother of invention.

KevinK · Dec 11, 2025

benb said:
which we glimpsed but didn't really understand with DeepSeek

If you are really interested in the details - here are some links on how SGLang, vLLM and TensorRT LLM implement some of the DeepSeek optimizations for use by all.

Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs | LMSYS Org

<p>DeepSeek is a popular open-source large language model (LLM) praised for its strong performance. However, its large size and unique architecture, which us...

lmsys.org

Scaling DeepSeek-style MoEs with vLLM and llm-d using Wide EP | Red Hat Developer

Learn how to deploy and scale Mixture of Experts (MoE) models using vLLM's new execution model and llm-d's intelligent Kubernetes-native inference framework

developers.redhat.com

https://www.perplexity.ai/hub/blog/lower-latency-and-higher-throughput-with-multi-node-deepseek-deployment

Xebec · Dec 11, 2025

@benb -
A small add-on to the excellent commentary above -

PyTorch *can* be very inefficient if used improperly, and that can skew some of the perception on PyTorch's efficiency.

Real world example of improper use causing 10X slowdown, (mainly) due to frequent loads from disk instead of RAM: https://stackoverflow.com/questions...h-code-10x-slower-than-the-tensorflow-version

Best practices for running PyTorch efficiently: https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html

ai268 · Dec 11, 2025

Wow. Quite a coincidence. Today I saw this article- the IBM System/3090 with Vector Facilities pitted against a Cray X-MP.

Driving HPC Performance Up Is Easier Than Keeping The Spending Constant

We are still mulling over all of the new HPC-AI supercomputer systems that were announced in recent months before and during the SC25 supercomputing

www.nextplatform.com

" In the 48 years since the Cray 1A vector supercomputer was launched, essentially creating the supercomputing market distinct from mainframe and minicomputing systems that could do math on their CPUs or boost it through auxiliary coprocessors, it has been very difficult to get the cost of systems. Yeah, plus ça change, plus c’est la même chose not only about the lack of precise pricing, but also in hybrid architectures for supercomputing, as witnessed in 1989 by the IBM System/3090 with Vector Facilities pitted against a Cray X-MP. "

It's nostalgic seeing those Fortran codes analysis, Scalar, Vector & Parallel, several decades later. It's also exciting that the half century good old Vector Processing gets a new life in the AI models.

benb said:
Newbie question: Is the infrastructure cost of AI partly related to inefficient coding languages like Python or PyTorch? Is efficient coding (which we glimpsed but didn't really understand with DeepSeek) something they do in China but not the USA?

Cheap & faster H/Ws make programmers lazy. Long, long time ago, programmer needed to be familiar with assembly language and computer architecture for writing efficient codes. Today maybe 90% of "software developer" never hear of assembly language. Circumstances had been changing.

One paper published almost 40 years ago - "Apple II is faster than Cray Y-MP " answered your questions. But certainly, algorithms have their limits.

bilau · Dec 12, 2025

blueone said:
The reason most teams outside of China do not go as deep as DeepSeek did is because they focus on the models, training, and inference, and depend on a higher performance hardware to get the response times they need. The US restricted China from buying the best AI processors, so they had to get more creative with classic software tuning efficiency tricks. Not to slight the DeepSeek accomplishments in any way, but I think that's all there is to it. As the saying goes, necessity is the mother of invention.

While I also think that's all there is to it, one must not overlook that it takes much more in-depth knowledge of the whole system to perform effective optimizations. So, through these exercises, they are training a pool of highly skilled (i.e. high quality) engineers, all while they already have a significant edge in quantity.

As to the hardware gap, their chips seem to be close enough that they can compensate with quantity and electricity, as least that's what the party claims!! (now of course, Alibaba and Bytedance are lining up to get H200s!)

blueone · Dec 12, 2025

bilau said:
While I also think that's all there is to it, one must not overlook that it takes much more in-depth knowledge of the whole system to perform effective optimizations. So, through these exercises, they are training a pool of highly skilled (i.e. high quality) engineers, all while they already have a significant edge in quantity.

I completely agree. Software engineers who understand how the hardware is designed and works in-depth are the real big picture people.

bilau said:
As to the hardware gap, their chips seem to be close enough that they can compensate with quantity and electricity, as least that's what the party claims!! (now of course, Alibaba and Bytedance are lining up to get H200s!)

I didn't know Alibaba and Bytedance were already ordering H200s.

KevinK · Dec 12, 2025

bilau said:
While I also think that's all there is to it, one must not overlook that it takes much more in-depth knowledge of the whole system to perform effective optimizations. So, through these exercises, they are training a pool of highly skilled (i.e. high quality) engineers, all while they already have a significant edge in quantity.

There’s a big difference between optimizing a fixed data center pod configuration for a single model for a single PLPC (performance, latency, power, and cost point) à la DeepChip, versus building system-level infrastructure that targets optimized results broadly for different models and even variations, like context length, within a single model, plus do those optimizations for different pod configurations, GPUs/TPUs, and for different PLPC points. Pretty sure that Google and NVIDIA are the best at that, based on published info.

blueone · Dec 12, 2025

KevinK said:
There’s a big difference between optimizing a fixed data center pod configuration for a single model for a single PLPC (performance, latency, power, and cost point) à la DeepChip, versus building system-level infrastructure that targets optimized results broadly for different models and even variations, like context length, within a single model, plus do those optimizations for different pod configurations, GPUs/TPUs, and for different PLPC points. Pretty sure that Google and NVIDIA are the best at that, based on published info.

I think you and @bilau (and me) are talking about two different levels of optimization. We were talking about application-level optimization, which is what PTX is used for, for example. What you seem to be discussing is above that, and is about system configurations (correct me if I’ve misunderstood). I would agree that at the system level Google and Nvidia strike me as the best sources.

KevinK · Dec 12, 2025

blueone said:
We were talking about application-level optimization, which is what PTX is used for, for example. What you seem to be discussing is above that, and is about system configurations (correct me if I’ve misunderstood).

I think the difference you are outlining is mostly true. My point was more that application-level optimization at the low-level for serving one specific model, isn’t transferable to other models, or even the same model at different PLPC points or with different context lengths.

KevinK · Dec 13, 2025

Saw these details assembled SemiAnalysis. From their perspective, Anthropic went with Google/TPU so they could be at the front of the line, ahead of OpenAI and others when it came to hardware allocation (OpenAI was tying up all available AI capacity coming via neocloud providers).

KevinK · Friday at 2:52 PM

Thought this was a useful view on how rack-level AI Inference processors are evolving at both Amazon and NVIDIA.
Cable-less and hose-less (water cooled) compute and switching tray design. Not sure about Google TPU.

NVIDIA Cuts Compute Tray Assembly Time 36x with Cableless Design | SemiAnalysis posted on the topic | LinkedIn

IMPORTANT: NVIDIA announced that their compute tray assembly time fell by 36x from 2 hours to 5 minutes due to the new VR200 cableless & hose less design. NVIDIA follows the footsteps of AWS Trainium2/3 cableless design that leds to faster manufacturing time & full automation as robotics still...

www.linkedin.com

siliconbruh999 · 2026-01-31T05:51:50-0800

can we please at least not post clickbait headline?

Search

Google’s 400,000-Chip Monster Tensor Processing Unit Just Destroyed NVIDIA's Future!

KevinK

Well-known member

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

blueone

Well-known member

KevinK

Well-known member

Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs | LMSYS Org

Scaling DeepSeek-style MoEs with vLLM and llm-d using Wide EP | Red Hat Developer

Xebec

Well-known member

ai268

Member

Driving HPC Performance Up Is Easier Than Keeping The Spending Constant

bilau

Member

blueone

Well-known member

KevinK

Well-known member

blueone

Well-known member

KevinK

Well-known member

KevinK

Well-known member

Attachments

KevinK

Well-known member

NVIDIA Cuts Compute Tray Assembly Time 36x with Cableless Design | SemiAnalysis posted on the topic | LinkedIn

siliconbruh999

Well-known member