Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/threads/google%E2%80%99s-400-000-chip-monster-tensor-processing-unit-just-destroyed-nvidias-future.24139/page-2
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2030770
            [XFI] => 1060170
        )

    [wordpress] => /var/www/html
)

Google’s 400,000-Chip Monster Tensor Processing Unit Just Destroyed NVIDIA's Future!

Newbie question: Is the infrastructure cost of AI partly related to inefficient coding languages like Python or PyTorch? Is efficient coding (which we glimpsed but didn't really understand with DeepSeek) something they do in China but not the USA?
My take is that Python and PyTorch both rely on well-optimized underlying application-specific and hardware-optimized packages that are tuned in the underlying native "dataflow assembly code" (CUDA stuff for NVIDIA, ROCM for AMD, etc.). DeepSeek identified some great areas for optimization - model (multi headed latent attention), model partitioning (prefill/decode disaggregation) and system-level communication, and improved them at mostly the low level (not via packages).


I don't know about other HW suppliers, but NVIDIA has created new system infrastructure inside Dynamo, plus worked on the three main model serving engines that support PyTorch, SGLang, vLLM, and TensorRT-LLM, to implement these DeepSeek optimizations, and more, in the underlying packages and architecture for inference serving with NVIDIA hardware.

So Python and PyTorch need help be fast and efficient with new models and hardware. DeepSeek mostly optimized at low level, but NVIDIA and associated developers followed up to make the optimizations accessible to all, through mostly open source.
 
Newbie question: Is the infrastructure cost of AI partly related to inefficient coding languages like Python or PyTorch? Is efficient coding (which we glimpsed but didn't really understand with DeepSeek) something they do in China but not the USA?
This is actually not a newbie question, it's very good one.

First, the simple stuff. PyTorch and Python are related only because PyTorch uses Python-similar syntax in its programming interface. The underlying layers of software PyTorch uses to interface with GPUs and TPUs is highly tuned code, from what I can tell usually C or C++. So the notion that using PyTorch implies an inefficient GPU/TPU interface doesn't appear to be accurate.

Regarding my negativity about Python as a programming language... I confess that I'm a high performance software purist, and I don't like programming languages that are interpreted (meaning, they are not compiled into processor-specific assembler code), or use a "virtual machine" (Java is another example) which translates and executes the programming language instructions into run-time subroutines. And Python is also a member of the class of programming languages which are called "memory safe", in which dynamic memory allocation and deallocation is managed by a compiler-defined memory manager, and these memory managers usually use a single-threaded data structure called a "heap" to manage memory segments and assignments. And background "garbage collection" for dynamically-deallocated memory areas to put them back in the heaps. 🤮 Talk about inefficient! The cloud companies love their memory-safe languages, like Rust (Microsoft), though at least Rust has an "unsafe" mode for high performance programming. Google invented a language called Go, which is also memory-safe, though considered less safe than Rust, but Go does appear to have a "secret" unsafe mode too.

DeepSeek did a lot of optimizations, not all of which are open for examination, but they did include controlling parallelism in PTX for Nvidia, and minimizing data copies and communications inefficiencies, which is common in network drivers. (In the networking world the ultimate in reducing data copies has been available for decades. It's called Remote Direct Memory Access, or RDMA, a capability widely used by Nvidia in Ethernet and InfiniBand.)

The reason most teams outside of China do not go as deep as DeepSeek did is because they focus on the models, training, and inference, and depend on a higher performance hardware to get the response times they need. The US restricted China from buying the best AI processors, so they had to get more creative with classic software tuning efficiency tricks. Not to slight the DeepSeek accomplishments in any way, but I think that's all there is to it. As the saying goes, necessity is the mother of invention.
 
which we glimpsed but didn't really understand with DeepSeek
If you are really interested in the details - here are some links on how SGLang, vLLM and TensorRT LLM implement some of the DeepSeek optimizations for use by all.



 
Wow. Quite a coincidence. Today I saw this article- the IBM System/3090 with Vector Facilities pitted against a Cray X-MP.


" In the 48 years since the Cray 1A vector supercomputer was launched, essentially creating the supercomputing market distinct from mainframe and minicomputing systems that could do math on their CPUs or boost it through auxiliary coprocessors, it has been very difficult to get the cost of systems. Yeah, plus ça change, plus c’est la même chose not only about the lack of precise pricing, but also in hybrid architectures for supercomputing, as witnessed in 1989 by the IBM System/3090 with Vector Facilities pitted against a Cray X-MP. "

It's nostalgic seeing those Fortran codes analysis, Scalar, Vector & Parallel, several decades later. It's also exciting that the half century good old Vector Processing gets a new life in the AI models.

Newbie question: Is the infrastructure cost of AI partly related to inefficient coding languages like Python or PyTorch? Is efficient coding (which we glimpsed but didn't really understand with DeepSeek) something they do in China but not the USA?
Cheap & faster H/Ws make programmers lazy. Long, long time ago, programmer needed to be familiar with assembly language and computer architecture for writing efficient codes. Today maybe 90% of "software developer" never hear of assembly language. Circumstances had been changing.

One paper published almost 40 years ago - "Apple II is faster than Cray Y-MP " answered your questions. But certainly, algorithms have their limits.
 
The reason most teams outside of China do not go as deep as DeepSeek did is because they focus on the models, training, and inference, and depend on a higher performance hardware to get the response times they need. The US restricted China from buying the best AI processors, so they had to get more creative with classic software tuning efficiency tricks. Not to slight the DeepSeek accomplishments in any way, but I think that's all there is to it. As the saying goes, necessity is the mother of invention.

While I also think that's all there is to it, one must not overlook that it takes much more in-depth knowledge of the whole system to perform effective optimizations. So, through these exercises, they are training a pool of highly skilled (i.e. high quality) engineers, all while they already have a significant edge in quantity.


As to the hardware gap, their chips seem to be close enough that they can compensate with quantity and electricity, as least that's what the party claims!! (now of course, Alibaba and Bytedance are lining up to get H200s!)
 
While I also think that's all there is to it, one must not overlook that it takes much more in-depth knowledge of the whole system to perform effective optimizations. So, through these exercises, they are training a pool of highly skilled (i.e. high quality) engineers, all while they already have a significant edge in quantity.
I completely agree. Software engineers who understand how the hardware is designed and works in-depth are the real big picture people.
As to the hardware gap, their chips seem to be close enough that they can compensate with quantity and electricity, as least that's what the party claims!! (now of course, Alibaba and Bytedance are lining up to get H200s!)
I didn't know Alibaba and Bytedance were already ordering H200s.
 
While I also think that's all there is to it, one must not overlook that it takes much more in-depth knowledge of the whole system to perform effective optimizations. So, through these exercises, they are training a pool of highly skilled (i.e. high quality) engineers, all while they already have a significant edge in quantity.
There’s a big difference between optimizing a fixed data center pod configuration for a single model for a single PLPC (performance, latency, power, and cost point) à la DeepChip, versus building system-level infrastructure that targets optimized results broadly for different models and even variations, like context length, within a single model, plus do those optimizations for different pod configurations, GPUs/TPUs, and for different PLPC points. Pretty sure that Google and NVIDIA are the best at that, based on published info.
 
There’s a big difference between optimizing a fixed data center pod configuration for a single model for a single PLPC (performance, latency, power, and cost point) à la DeepChip, versus building system-level infrastructure that targets optimized results broadly for different models and even variations, like context length, within a single model, plus do those optimizations for different pod configurations, GPUs/TPUs, and for different PLPC points. Pretty sure that Google and NVIDIA are the best at that, based on published info.
I think you and @bilau (and me) are talking about two different levels of optimization. We were talking about application-level optimization, which is what PTX is used for, for example. What you seem to be discussing is above that, and is about system configurations (correct me if I’ve misunderstood). I would agree that at the system level Google and Nvidia strike me as the best sources.
 
We were talking about application-level optimization, which is what PTX is used for, for example. What you seem to be discussing is above that, and is about system configurations (correct me if I’ve misunderstood).
I think the difference you are outlining is mostly true. My point was more that application-level optimization at the low-level for serving one specific model, isn’t transferable to other models, or even the same model at different PLPC points or with different context lengths.
 
Back
Top