Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/threads/why%E2%80%99s-nvidia-such-a-beast-it%E2%80%99s-that-cuda-thing.21393/page-2
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021770
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

Why’s Nvidia such a beast? It’s that CUDA thing.

Dynamo is open source as well. But more importantly, it offers dynamic reallocation and tuning of resources for max throughout or min token latency for each model inference instance running in an entire data center to optimize operations as models go through different phases (pre fill, token generation).
Cuda is not you can't run it without cuda can you?
 
Cuda is the modern glide ( https://en.m.wikipedia.org/wiki/Glide_(API). The proprietary single hardware company api never wins out in the end.

I presume we’ll end up with something like an OpenGL or DirectX interface for the calculations required backed by optimised hardware specific drivers. Once the most common compute patterns are established.

It just does not make sense for the AI companies to tie themselves so strongly to a single hardware vendor, too much lock in and nigh on impossible to negotiate on price.
 
I presume we’ll end up with something like an OpenGL or DirectX interface for the calculations required backed by optimised hardware specific drivers. Once the most common compute patterns are established.

It just does not make sense for the AI companies to tie themselves so strongly to a single hardware vendor, too much lock in and nigh on impossible to negotiate on price.

We already have better open solutions/frameworks for genAI, than OpenGL or DirectX, like PyTorch and TensorFlow. But higher level GenAI data center operations and orchestration solutions are required - DeepSeek has shown that that one can squeeze much more out of GPUs through disaggregation and intelligent resource allocation for transformer models. NVIDIA is making similar tuning available more broadly.
 
Last edited:
Cuda is the modern glide ( https://en.m.wikipedia.org/wiki/Glide_(API). The proprietary single hardware company api never wins out in the end.

I presume we’ll end up with something like an OpenGL or DirectX interface for the calculations required backed by optimised hardware specific drivers. Once the most common compute patterns are established.

It just does not make sense for the AI companies to tie themselves so strongly to a single hardware vendor, too much lock in and nigh on impossible to negotiate on price.

"It just does not make sense for the AI companies to tie themselves so strongly to a single hardware vendor, too much lock in and nigh on impossible to negotiate on price."

Ideally, it's correct. However, the problem is that most companies do not have the money or resources to develop, manufacture, support, and market their products for too many environments and standards from the outset. Meanwhile, there may not be other mature or comparable options available for them to choose from.

What can they do? They must deliver products and generate revenue before time runs out. They have no choice but to pick a side.
 
Cuda is the modern glide ( https://en.m.wikipedia.org/wiki/Glide_(API). The proprietary single hardware company api never wins out in the end.

I presume we’ll end up with something like an OpenGL or DirectX interface for the calculations required backed by optimised hardware specific drivers. Once the most common compute patterns are established.

It just does not make sense for the AI companies to tie themselves so strongly to a single hardware vendor, too much lock in and nigh on impossible to negotiate on price.
Direct X is supported by Multiple Hardware Vendor vs single Vendor for CUDA
 
Direct X is supported by Multiple Hardware Vendor vs single Vendor for CUDA
DirectX isn’t even in the game when it comes to GenAI. My take is that some parts of CUDA, RoCM (AMD), etc. are essentially the assembly code for GenAI. Much of the real innovation is taking place further up the stack, though occasionally optimizations done higher up require new features (instructions) in the CUDA (or other low level GPU management API). And new hardware features (like FP4) need the same.
 
I got look at Blackwell, impressive. Remember, it is a system and not just a chip. If you haven't already, take a look at Jensen's GTC keynote. It looks like the CUDA moat is getting larger every year.



Masterclass in AI........
 
Remember, it is a system and not just a chip.
My two takeaways from the keynote, especially the part of data center:

* It’s not just a system but a huge data center system - ultra-high bandwidth GPU to GPU networking, water cooling and power distribution of megawatts are becoming considerations as important as the core GPU chips.
* The software to manage these beasts for inference efficiently is even more important. DeepSeek got much of their inference efficiently from fine tuning / disaggregating their model to and fitting it to the their specific data center configuration. But NVIDIA’s new “data center GenAI OS” can automate and generalize the same optimizations (and possibly more) for a broad range of NVIDIA-equipped data center configurations. Sounds like they have already tested with DeepSeek and Perplexity models.
 
My two takeaways from the keynote, especially the part of data center:

* It’s not just a system but a huge data center system - ultra-high bandwidth GPU to GPU networking, water cooling and power distribution of megawatts are becoming considerations as important as the core GPU chips.
* The software to manage these beasts for inference efficiently is even more important. DeepSeek got much of their inference efficiently from fine tuning / disaggregating their model to and fitting it to the their specific data center configuration. But NVIDIA’s new “data center GenAI OS” can automate and generalize the same optimizations (and possibly more) for a broad range of NVIDIA-equipped data center configurations. Sounds like they have already tested with DeepSeek and Perplexity models.

I agree completely.

But the next time I hear Blackwell is delayed, wafer yield will not be my first guess.
 
Back
Top