Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/threads/just-how-deep-is-nvidias-cuda-moat-really.21703/
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021770
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

Just how deep is Nvidia's CUDA moat really?

XYang2023

Well-known member
Analysis Nvidia is facing its stiffest competition in years with new accelerators from Intel and AMD that challenge its best chips on memory capacity, performance, and price.

However, it's not enough just to build a competitive part: you also have to have software that can harness all those FLOPS – something Nvidia has spent the better part of two decades building with its CUDA runtime.

Nvidia is well established within the developer community. A lot of codebases have been written and optimized for its specific brand of hardware, while competing frameworks for low-level GPU programming are far less mature. This early momentum is often referred to as "the CUDA moat."
But just how deep is this moat in reality?

 
The Register is mildly entertaining and useful for hearing industry rumors and hearsay. For anything technical, not at all. CUDA and the alternatives is a very technical discussion, as is what happens under the covers when you really want to work in PyTorch on various AI processors and their software stacks. That article is mostly :poop:
 
The Register is mildly entertaining and useful for hearing industry rumors and hearsay. For anything technical, not at all. CUDA and the alternatives is a very technical discussion, as is what happens under the covers when you really want to work in PyTorch on various AI processors and their software stacks. That article is mostly :poop:
I ordered a B580 and currently have both the 4090 and 3090. I plan to compare them based on my use case. If the B580 proves sufficient, I may no longer need the NVIDIA cards for certain tasks. As the article mentioned, many people nowadays tend to focus on working at the PyTorch level.
 
I ordered a B580 and currently have both the 4090 and 3090. I plan to compare them based on my use case. If the B580 proves sufficient, I may no longer need the NVIDIA cards for certain tasks. As the article mentioned, many people nowadays tend to focus on working at the PyTorch level.
Good for you intel last week allowed use of Intel gpu in pytorch directly like Nvidia hope they continue to mature their SW
 
Good for you intel last week allowed use of Intel gpu in pytorch directly like Nvidia hope they continue to mature their SW
Did you mean this?


Pytorch should update their download options for Intel GPUs:

 
Last edited:
Analysis Nvidia is facing its stiffest competition in years with new accelerators from Intel and AMD that challenge its best chips on memory capacity, performance, and price.

However, it's not enough just to build a competitive part: you also have to have software that can harness all those FLOPS – something Nvidia has spent the better part of two decades building with its CUDA runtime.

Nvidia is well established within the developer community. A lot of codebases have been written and optimized for its specific brand of hardware, while competing frameworks for low-level GPU programming are far less mature. This early momentum is often referred to as "the CUDA moat."
But just how deep is this moat in reality?

NVDA moat is very deep in large scale AI training. It stems primarily from the ecosystem of developer expertise rather than just CUDA itself. While CUDA's performance and usability matter, the true moat lies in the accumulated knowledge of how to handle infrastructure challenges when orchestrating training across massive GPU clusters. This specialized expertise in managing tens of thousands of GPUs simultaneously - dealing with issues like distributed computing, memory management, and system optimization - is much harder for competitors to replicate than the technical aspects of GPU programming interfaces.

That being said, in AI inference, these challenges don't really exist.

If you believe AI load is largely transitioning to inference, then its moat is fading, fast.
 
NVDA moat is very deep in large scale AI training. It stems primarily from the ecosystem of developer expertise rather than just CUDA itself. While CUDA's performance and usability matter, the true moat lies in the accumulated knowledge of how to handle infrastructure challenges when orchestrating training across massive GPU clusters. This specialized expertise in managing tens of thousands of GPUs simultaneously - dealing with issues like distributed computing, memory management, and system optimization - is much harder for competitors to replicate than the technical aspects of GPU programming interfaces.

That being said, in AI inference, these challenges don't really exist.

If you believe AI load is largely transitioning to inference, then its moat is fading, fast.
I think Intel is also competent in that regard. They are likely waiting for Falcon Shores to address training workloads. For inference, Gaudi is sufficient for now. Additionally, for many use cases, people do not train models from scratch; instead, they fine-tune existing models with their own data. In such instances, you don't need an army of GPUs.
 
Did you mean this?


Pytorch should update their download options for Intel GPUs:

Yes it will mature nicely in time for celestial and Falcon shores
 
I ordered a B580 and currently have both the 4090 and 3090. I plan to compare them based on my use case. If the B580 proves sufficient, I may no longer need the NVIDIA cards for certain tasks. As the article mentioned, many people nowadays tend to focus on working at the PyTorch level.
I would like to hear more about how the B580 works out for you, if you're able to share. I bought one for my niece as a gaming GPU upgrade (arrives later this week), but I'm debating getting one multipurpose uses for myself.
 
I would like to hear more about how the B580 works out for you, if you're able to share. I bought one for my niece as a gaming GPU upgrade (arrives later this week), but I'm debating getting one multipurpose uses for myself.
I plan to compare the B580 with my 3090 and 4090 in terms of PyTorch performance.
 
One got Lisa the other got PSO, BK, Bob and that was that.

Could have been a very different world if after CRB Andy and the BoD had selected Mike Splinter or Pat. It’s hard for a non founder to pivot a goldmine
Yeah it's even harder to find competent people like Grove or in AMDs case Lisa
 
NVDA moat is very deep in large scale AI training. It stems primarily from the ecosystem of developer expertise rather than just CUDA itself. While CUDA's performance and usability matter, the true moat lies in the accumulated knowledge of how to handle infrastructure challenges when orchestrating training across massive GPU clusters. This specialized expertise in managing tens of thousands of GPUs simultaneously - dealing with issues like distributed computing, memory management, and system optimization - is much harder for competitors to replicate than the technical aspects of GPU programming interfaces.

That being said, in AI inference, these challenges don't really exist.
I think you pegged it rightly - NVIDIA is going to remain tightly wired in on training and the real battle is inference. From what I'm seeing the new HW/SW decision criteria for inference are going to be TCO, tools for building an entire GenAI app / system, plus a minimum level of performance for a particular application. The moat is pretty small when it come to running and benchmarking off-the-shelf GenAI models thanks to PyTorch, etc. But bare models, even ones as good as GPT 4o are pretty useless to enterprises, without an entire application system wrapped around them. Guys like Grok and Cerebras are focused on fastest token rates and super low latency but suffer from high TCO - that's why they have moved to selling hosted services. AMS, Microsoft and Google are looking at their own chips with their own app development environments, and AMD bought a company to help develop application solutions for GenAI. NVIDIA seems to be focused on on creating an Enterprise environment with NIMs/NEM, though they just mentioned they might also offer hosted services.
 
I think you pegged it rightly - NVIDIA is going to remain tightly wired in on training and the real battle is inference. From what I'm seeing the new HW/SW decision criteria for inference are going to be TCO, tools for building an entire GenAI app / system, plus a minimum level of performance for a particular application. The moat is pretty small when it come to running and benchmarking off-the-shelf GenAI models thanks to PyTorch, etc. But bare models, even ones as good as GPT 4o are pretty useless to enterprises, without an entire application system wrapped around them. Guys like Grok and Cerebras are focused on fastest token rates and super low latency but suffer from high TCO - that's why they have moved to selling hosted services. AMS, Microsoft and Google are looking at their own chips with their own app development environments, and AMD bought a company to help develop application solutions for GenAI. NVIDIA seems to be focused on on creating an Enterprise environment with NIMs/NEM, though they just mentioned they might also offer hosted services.




 
Back
Top