Just how deep is Nvidia's CUDA moat really?

XYang2023 · Dec 17, 2024

Analysis Nvidia is facing its stiffest competition in years with new accelerators from Intel and AMD that challenge its best chips on memory capacity, performance, and price.

However, it's not enough just to build a competitive part: you also have to have software that can harness all those FLOPS – something Nvidia has spent the better part of two decades building with its CUDA runtime.

Nvidia is well established within the developer community. A lot of codebases have been written and optimized for its specific brand of hardware, while competing frameworks for low-level GPU programming are far less mature. This early momentum is often referred to as "the CUDA moat."
But just how deep is this moat in reality?

Nvidia's CUDA moat may not be as impenetrable as you think

Not as impenetrable as you might think, but still more than Intel or AMD would like

www.theregister.com

blueone · Dec 17, 2024

The Register is mildly entertaining and useful for hearing industry rumors and hearsay. For anything technical, not at all. CUDA and the alternatives is a very technical discussion, as is what happens under the covers when you really want to work in PyTorch on various AI processors and their software stacks. That article is mostly

XYang2023 · Dec 17, 2024

blueone said:
The Register is mildly entertaining and useful for hearing industry rumors and hearsay. For anything technical, not at all. CUDA and the alternatives is a very technical discussion, as is what happens under the covers when you really want to work in PyTorch on various AI processors and their software stacks. That article is mostly

I ordered a B580 and currently have both the 4090 and 3090. I plan to compare them based on my use case. If the B580 proves sufficient, I may no longer need the NVIDIA cards for certain tasks. As the article mentioned, many people nowadays tend to focus on working at the PyTorch level.

siliconbruh999 · Dec 18, 2024

XYang2023 said:
I ordered a B580 and currently have both the 4090 and 3090. I plan to compare them based on my use case. If the B580 proves sufficient, I may no longer need the NVIDIA cards for certain tasks. As the article mentioned, many people nowadays tend to focus on working at the PyTorch level.

Good for you intel last week allowed use of Intel gpu in pytorch directly like Nvidia hope they continue to mature their SW

XYang2023 · Dec 18, 2024

siliconbruh999 said:
Good for you intel last week allowed use of Intel gpu in pytorch directly like Nvidia hope they continue to mature their SW

Did you mean this?

PyTorch 2.5 Release Includes Support for Intel GPUs

The PyTorch Foundation recently released PyTorch version 2.5, which contains support for Intel GPUs. The release also includes several performance enhancements, such as the FlexAttention API, TorchInductor CPU backend optimizations, and a regional compilation feature which reduces compilation...

www.infoq.com

Pytorch should update their download options for Intel GPUs:

PyTorch

PyTorch Foundation is the deep learning community home for the open source PyTorch framework and ecosystem.

pytorch.org

Getting Started on Intel GPU — PyTorch 2.7 documentation

kevin01 · Dec 18, 2024

XYang2023 said:
Analysis Nvidia is facing its stiffest competition in years with new accelerators from Intel and AMD that challenge its best chips on memory capacity, performance, and price.

However, it's not enough just to build a competitive part: you also have to have software that can harness all those FLOPS – something Nvidia has spent the better part of two decades building with its CUDA runtime.

Nvidia is well established within the developer community. A lot of codebases have been written and optimized for its specific brand of hardware, while competing frameworks for low-level GPU programming are far less mature. This early momentum is often referred to as "the CUDA moat."
But just how deep is this moat in reality?

Nvidia's CUDA moat may not be as impenetrable as you think

Not as impenetrable as you might think, but still more than Intel or AMD would like

www.theregister.com

NVDA moat is very deep in large scale AI training. It stems primarily from the ecosystem of developer expertise rather than just CUDA itself. While CUDA's performance and usability matter, the true moat lies in the accumulated knowledge of how to handle infrastructure challenges when orchestrating training across massive GPU clusters. This specialized expertise in managing tens of thousands of GPUs simultaneously - dealing with issues like distributed computing, memory management, and system optimization - is much harder for competitors to replicate than the technical aspects of GPU programming interfaces.

That being said, in AI inference, these challenges don't really exist.

If you believe AI load is largely transitioning to inference, then its moat is fading, fast.

XYang2023 · Dec 18, 2024

kevin01 said:
NVDA moat is very deep in large scale AI training. It stems primarily from the ecosystem of developer expertise rather than just CUDA itself. While CUDA's performance and usability matter, the true moat lies in the accumulated knowledge of how to handle infrastructure challenges when orchestrating training across massive GPU clusters. This specialized expertise in managing tens of thousands of GPUs simultaneously - dealing with issues like distributed computing, memory management, and system optimization - is much harder for competitors to replicate than the technical aspects of GPU programming interfaces.

That being said, in AI inference, these challenges don't really exist.

If you believe AI load is largely transitioning to inference, then its moat is fading, fast.

I think Intel is also competent in that regard. They are likely waiting for Falcon Shores to address training workloads. For inference, Gaudi is sufficient for now. Additionally, for many use cases, people do not train models from scratch; instead, they fine-tune existing models with their own data. In such instances, you don't need an army of GPUs.

siliconbruh999 · Dec 18, 2024

XYang2023 said:
Did you mean this?

PyTorch 2.5 Release Includes Support for Intel GPUs

The PyTorch Foundation recently released PyTorch version 2.5, which contains support for Intel GPUs. The release also includes several performance enhancements, such as the FlexAttention API, TorchInductor CPU backend optimizations, and a regional compilation feature which reduces compilation...

www.infoq.com

Pytorch should update their download options for Intel GPUs:

PyTorch

PyTorch Foundation is the deep learning community home for the open source PyTorch framework and ecosystem.

pytorch.org

Getting Started on Intel GPU — PyTorch 2.7 documentation

Yes it will mature nicely in time for celestial and Falcon shores

XYang2023 · Dec 18, 2024

siliconbruh999 said:
Yes it will mature nicely in time for celestial and Falcon shores

It could be due to the addition of beta testers (B580/570 owners). PyTorch 2.6 might provide a much smoother experience for Intel GPUs

Xebec · Dec 18, 2024

XYang2023 said:
I ordered a B580 and currently have both the 4090 and 3090. I plan to compare them based on my use case. If the B580 proves sufficient, I may no longer need the NVIDIA cards for certain tasks. As the article mentioned, many people nowadays tend to focus on working at the PyTorch level.

I would like to hear more about how the B580 works out for you, if you're able to share. I bought one for my niece as a gaming GPU upgrade (arrives later this week), but I'm debating getting one multipurpose uses for myself.

XYang2023 · Dec 18, 2024

Xebec said:
I would like to hear more about how the B580 works out for you, if you're able to share. I bought one for my niece as a gaming GPU upgrade (arrives later this week), but I'm debating getting one multipurpose uses for myself.

I plan to compare the B580 with my 3090 and 4090 in terms of PyTorch performance.

siliconbruh999 · Dec 18, 2024

XYang2023 said:
I plan to compare the B580 with my 3090 and 4090 in terms of PyTorch performance.

It's going to loose against those you don't even need to test.
4090 has 2.8X theoretical TOPs vs B580

BruceA · Dec 18, 2024

CUDA to Nvidia is like ISA x86 to Intel.

siliconbruh999 · Dec 18, 2024

BruceA said:
CUDA to Nvidia is like ISA x86 to Intel.

Except we don't have a duopoly like x86

BruceA · Dec 18, 2024

siliconbruh999 said:
Except we don't have a duopoly like x86

One got Lisa the other got PSO, BK, Bob and that was that.

Could have been a very different world if after CRB Andy and the BoD had selected Mike Splinter or Pat. It’s hard for a non founder to pivot a goldmine

VCT · Dec 18, 2024

XYang2023 said:
I plan to compare the B580 with my 3090 and 4090 in terms of PyTorch performance.

Please share your review with us later.

siliconbruh999 · Dec 18, 2024

BruceA said:
One got Lisa the other got PSO, BK, Bob and that was that.

Could have been a very different world if after CRB Andy and the BoD had selected Mike Splinter or Pat. It’s hard for a non founder to pivot a goldmine

Yeah it's even harder to find competent people like Grove or in AMDs case Lisa

XYang2023 · Dec 18, 2024

siliconbruh999 said:
It's going to loose against those you don't even need to test.
4090 has 2.8X theoretical TOPs vs B580

For performance/cost. they are affordable for university students/phds. It appears there is also a 24GB version:

Shipping document suggests that a 24 GB version of Intel's Arc B580 graphics card could be heading to market, though not for gaming

AI setups love RAM and matrix cores, so a stack of big VRAM B580s would be snapped up.

www.pcgamer.com

KevinK · Dec 18, 2024

kevin01 said:
NVDA moat is very deep in large scale AI training. It stems primarily from the ecosystem of developer expertise rather than just CUDA itself. While CUDA's performance and usability matter, the true moat lies in the accumulated knowledge of how to handle infrastructure challenges when orchestrating training across massive GPU clusters. This specialized expertise in managing tens of thousands of GPUs simultaneously - dealing with issues like distributed computing, memory management, and system optimization - is much harder for competitors to replicate than the technical aspects of GPU programming interfaces.

That being said, in AI inference, these challenges don't really exist.

I think you pegged it rightly - NVIDIA is going to remain tightly wired in on training and the real battle is inference. From what I'm seeing the new HW/SW decision criteria for inference are going to be TCO, tools for building an entire GenAI app / system, plus a minimum level of performance for a particular application. The moat is pretty small when it come to running and benchmarking off-the-shelf GenAI models thanks to PyTorch, etc. But bare models, even ones as good as GPT 4o are pretty useless to enterprises, without an entire application system wrapped around them. Guys like Grok and Cerebras are focused on fastest token rates and super low latency but suffer from high TCO - that's why they have moved to selling hosted services. AMS, Microsoft and Google are looking at their own chips with their own app development environments, and AMD bought a company to help develop application solutions for GenAI. NVIDIA seems to be focused on on creating an Enterprise environment with NIMs/NEM, though they just mentioned they might also offer hosted services.

XYang2023 · Dec 18, 2024

KevinK said:
I think you pegged it rightly - NVIDIA is going to remain tightly wired in on training and the real battle is inference. From what I'm seeing the new HW/SW decision criteria for inference are going to be TCO, tools for building an entire GenAI app / system, plus a minimum level of performance for a particular application. The moat is pretty small when it come to running and benchmarking off-the-shelf GenAI models thanks to PyTorch, etc. But bare models, even ones as good as GPT 4o are pretty useless to enterprises, without an entire application system wrapped around them. Guys like Grok and Cerebras are focused on fastest token rates and super low latency but suffer from high TCO - that's why they have moved to selling hosted services. AMS, Microsoft and Google are looking at their own chips with their own app development environments, and AMD bought a company to help develop application solutions for GenAI. NVIDIA seems to be focused on on creating an Enterprise environment with NIMs/NEM, though they just mentioned they might also offer hosted services.

Just how deep is Nvidia's CUDA moat really?

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member