Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/threads/google%E2%80%99s-400-000-chip-monster-tensor-processing-unit-just-destroyed-nvidias-future.24139/page-2
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2030770
            [XFI] => 1060170
        )

    [wordpress] => /var/www/html
)

Google’s 400,000-Chip Monster Tensor Processing Unit Just Destroyed NVIDIA's Future!

Newbie question: Is the infrastructure cost of AI partly related to inefficient coding languages like Python or PyTorch? Is efficient coding (which we glimpsed but didn't really understand with DeepSeek) something they do in China but not the USA?
My take is that Python and PyTorch both rely on well-optimized underlying application-specific and hardware-optimized packages that are tuned in the underlying native "dataflow assembly code" (CUDA stuff for NVIDIA, ROCM for AMD, etc.). DeepSeek identified some great areas for optimization - model (multi headed latent attention), model partitioning (prefill/decode disaggregation) and system-level communication, and improved them at mostly the low level (not via packages).


I don't know about other HW suppliers, but NVIDIA has created new system infrastructure inside Dynamo, plus worked on the three main model serving engines that support PyTorch, SGLang, vLLM, and TensorRT-LLM, to implement these DeepSeek optimizations, and more, in the underlying packages and architecture for inference serving with NVIDIA hardware.

So Python and PyTorch need help be fast and efficient with new models and hardware. DeepSeek mostly optimized at low level, but NVIDIA and associated developers followed up to make the optimizations accessible to all, through mostly open source.
 
Newbie question: Is the infrastructure cost of AI partly related to inefficient coding languages like Python or PyTorch? Is efficient coding (which we glimpsed but didn't really understand with DeepSeek) something they do in China but not the USA?
This is actually not a newbie question, it's very good one.

First, the simple stuff. PyTorch and Python are related only because PyTorch uses Python-similar syntax in its programming interface. The underlying layers of software PyTorch uses to interface with GPUs and TPUs is highly tuned code, from what I can tell usually C or C++. So the notion that using PyTorch implies an inefficient GPU/TPU interface doesn't appear to be accurate.

Regarding my negativity about Python as a programming language... I confess that I'm a high performance software purist, and I don't like programming languages that are interpreted (meaning, they are not compiled into processor-specific assembler code), or use a "virtual machine" (Java is another example) which translates and executes the programming language instructions into run-time subroutines. And Python is also a member of the class of programming languages which are called "memory safe", in which dynamic memory allocation and deallocation is managed by a compiler-defined memory manager, and these memory managers usually use a single-threaded data structure called a "heap" to manage memory segments and assignments. And background "garbage collection" for dynamically-deallocated memory areas to put them back in the heaps. 🤮 Talk about inefficient! The cloud companies love their memory-safe languages, like Rust (Microsoft), though at least Rust has an "unsafe" mode for high performance programming. Google invented a language called Go, which is also memory-safe, though considered less safe than Rust, but Go does appear to have a "secret" unsafe mode too.

DeepSeek did a lot of optimizations, not all of which are open for examination, but they did include controlling parallelism in PTX for Nvidia, and minimizing data copies and communications inefficiencies, which is common in network drivers. (In the networking world the ultimate in reducing data copies has been available for decades. It's called Remote Direct Memory Access, or RDMA, a capability widely used by Nvidia in Ethernet and InfiniBand.)

The reason most teams outside of China do not go as deep as DeepSeek did is because they focus on the models, training, and inference, and depend on a higher performance hardware to get the response times they need. The US restricted China from buying the best AI processors, so they had to get more creative with classic software tuning efficiency tricks. Not to slight the DeepSeek accomplishments in any way, but I think that's all there is to it. As the saying goes, necessity is the mother of invention.
 
which we glimpsed but didn't really understand with DeepSeek
If you are really interested in the details - here are some links on how SGLang, vLLM and TensorRT LLM implement some of the DeepSeek optimizations for use by all.



 
Back
Top