Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/threads/just-how-deep-is-nvidias-cuda-moat-really.21703/page-2
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021770
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

Just how deep is Nvidia's CUDA moat really?

@XYang2023 ,
Intel certainly has basic capabilities with Gaudi, but it's not even clear with current Intel situation that management is firmly behind Gaudi - they seem to be leaning into Falcon Shores and beyond instead. And the real question is how do they deliver solutions to enterprises - this stuff is more like a Heathkit when enterprises prefer to buy ready-to-go TVs.
 
@XYang2023 ,
Intel certainly has basic capabilities with Gaudi, but it's not even clear with current Intel situation that management is firmly behind Gaudi - they seem to be leaning into Falcon Shores and beyond instead. And the real question is how do they deliver solutions to enterprises - this stuff is more like a Heathkit when enterprises prefer to buy ready-to-go TVs.
Based on what I’ve read, Falcon Shores appears to be an XPU platform. It allows the integration of GPU tiles or accelerators like Gaudi. This enables differentiation across market segments while maintaining a unified API.

I was previously wondering why Intel hadn’t offered a similar custom accelerator solution like those from Marvell or Broadcom. Intel introduced the XPU concept some time ago and has been using 'XPU' terminology for their PyTorch work.
 
Installed capacity and user bases makes CUDA valuable. One example is AWS. I personally used AWS for AI research and this is what I experienced.

- AWS has everything you need for internet based services. different types of server computers, load balancer, URL, security features, storage images...etc
- In AWS, there are 10 different types of NVIDIA GPU servers (M60~H200) but only 1 type of AMD GPU server (G4ad, V520 based)
- So it's easy to test and migrate software if I stick to CUDA-related software. You can make certain AI image in A100 server then migrate that image to H100...etc or deploy it x100 more with single mouse click
- Amazon even supports GPU clustering for NVIDIA GPUs(UltraCluster).

Another thing you have to consider is OS and software compatibility. NVIDIA supports x86, ARM, PowerPC 64 cpu with 9 different linux OS(and each major versions) and windows. But AMD MI300X supports only Ubuntu 22.04. So, If my data center has some software which relies on RHEL, then I have no choice but to use NVIDIA GPU.
So in the end, even if someone successfully compiles CUDA binary for other GPU(not really hard) there are so many things to consider in businesses.
 
Installed capacity and user bases makes CUDA valuable. One example is AWS. I personally used AWS for AI research and this is what I experienced.

- AWS has everything you need for internet based services. different types of server computers, load balancer, URL, security features, storage images...etc
- In AWS, there are 10 different types of NVIDIA GPU servers (M60~H200) but only 1 type of AMD GPU server (G4ad, V520 based)
- So it's easy to test and migrate software if I stick to CUDA-related software. You can make certain AI image in A100 server then migrate that image to H100...etc or deploy it x100 more with single mouse click
- Amazon even supports GPU clustering for NVIDIA GPUs(UltraCluster).

Another thing you have to consider is OS and software compatibility. NVIDIA supports x86, ARM, PowerPC 64 cpu with 9 different linux OS(and each major versions) and windows. But AMD MI300X supports only Ubuntu 22.04. So, If my data center has some software which relies on RHEL, then I have no choice but to use NVIDIA GPU.
So in the end, even if someone successfully compiles CUDA binary for other GPU(not really hard) there are so many things to consider in businesses.
I agree that Amazon is very convenient to use. For Intel and AMD to make a difference, they need to collaborate with large CSPs. However, using the cloud is also expensive. In our lab, we are running a small cluster consisting of GPUs ranging from the 1080 to the 3090 models.
 
NVDA moat is very deep in large scale AI training. It stems primarily from the ecosystem of developer expertise rather than just CUDA itself. While CUDA's performance and usability matter, the true moat lies in the accumulated knowledge of how to handle infrastructure challenges when orchestrating training across massive GPU clusters. This specialized expertise in managing tens of thousands of GPUs simultaneously - dealing with issues like distributed computing, memory management, and system optimization - is much harder for competitors to replicate than the technical aspects of GPU programming interfaces.

That being said, in AI inference, these challenges don't really exist.

If you believe AI load is largely transitioning to inference, then its moat is fading, fast.
Seems like the new o1 model is more focus on inference. In this case , is GPU still the preferred hardware ?
 
Yeah it's even harder to find competent people like Grove or in AMDs case Lisa
I was thinking about this and how Intel's lost its way. Perhaps it's the case that the Grove culture could only work in an environment of robust, first rate people who could take the confrontational approach and thrive on it. But at some point, the supply of such people must run out (most people aren't made that way) and the thing just doesn't scale any more.
 
I was thinking about this and how Intel's lost its way. Perhaps it's the case that the Grove culture could only work in an environment of robust, first rate people who could take the confrontational approach and thrive on it. But at some point, the supply of such people must run out (most people aren't made that way) and the thing just doesn't scale any more.
You only need few key people the leader and his trustworthy lieutenants but as you said supply runs out
 
Back
Top