Why’s Nvidia such a beast? It’s that CUDA thing.

Daniel Nenni · Nov 4, 2024

Nvidia's CUDA core processors are designed to take advantage of parallel processing to speed up compute-intensive applications such as AI, 3D rendering, gaming simulation and more. (Nvidia)

Nvidia is arguably the top contender when it comes to high-performance computing, gaming and AI processing, far surpassing AMD and Intel for the performance crown. But what sets Nvidia's chips apart from other manufacturers that others haven't been able to replicate? That success lies with parallel processing – a process that's used to increase the computational speed of computer systems by performing multiple data-processing operations simultaneously.

Unlike CPUs (Central Processing Units) that have multiple cores that process tasks sequentially, GPUs (Graphics Processing Units) can utilize thousands of cores to handle multiple tasks at the same time. That architectural advantage provides the leverage needed to handle today's AI algorithms, which require massive amounts of data processing. To put that into perspective, imagine one person trying to build a skyscraper versus thousands working together.

Enter CUDA

Looking at AI looming on the horizon, Nvidia saw a need for a robust software environment that could benefit from the company's powerful hardware, and CUDA was the result. The programming language, first introduced in 2006, allows developers to take advantage of parallel processing capabilities for demanding AI applications. (CUDA stands for Compute Unified Device Architecture.)

Nvidia's move not only opened the door to new possibilities but also laid the groundwork for a CUDA ecosystem, ushering the company to the top of the GPU food chain. Its flagship AI GPUs, combined with the company's CUDA software, led to such a head start on the competition that switching to an alternative now seems almost unthinkable for many large organizations. So, what does CUDA have to offer? Here’s a look at some core features:

Massive Parallelism: The CUDA architecture is designed to leverage thousands of CUDA cores, allowing the execution of many threads, making it ideal for tasks such as image rendering, scientific calculations, machine learning, computer vision, big data processing and more. CUDA core processors are hardware that act as a small processing unit within an Nvidia GPU and function as a mini-CPU to handle thousands of threads simultaneously.
Hierarchical Thread Organization: CUDA organizes threads into blocks and grids, making it easier to manage and optimize parallel execution and processing, allowing developers to take advantage of hardware resources.
Dynamic Parallelism: This allows kernels (functions executed on the GPU) to launch additional kernels to enable more flexible and dynamic programming models and simplify code for recursive algorithms or adaptive workloads.
Unified Memory: Nvidia's unified memory facilitates the sharing of information between the GPU and CPU, simplifying memory management and enhancing performance by migrating to the appropriate memory space.
Shared Memory: Each block of threads has access to shared memory, allowing for faster data exchanges among threads compared to global memory (logical space), which improves performance.
Optimized Libraries: The CUDA software comes with a suite of optimized libraries to increase performance, including cuBLAS for linear algebra, cuDNN for deep learning, Thrust for parallel algorithms and more.
Error Handling/Compiler Support: CUDA offers built-in error-handling capabilities that diagnose issues during the development phase to improve efficiency. It also features compiler support for developers to create code using familiar syntax, making it easy to inject GPU computing into existing applications.

While AMD and Intel are also developing AI chips, Nvidia's lead and comprehensive approach have positioned it as the undisputed leader in the AI boom, which is reflected in the company's market valuation, which beats the other companies combined. AMD currently has a market value of $315 billion, which towers over Intel's $185 billion; however, both are significantly dwarfed by Nvidia's astounding $2 trillion market cap.

CUDA Applications

Since its introduction in 2006, CUDA has been widely deployed for thousands of applications and research papers and is supported by a base of over 500 million GPUs found in PCs, notebooks, laptops, workstations, data centers and even supercomputers. CUDA cores have been tapped for astronomy, biology, chemistry, physics, data mining, manufacturing, finance, and other computationally intense fields; however, AI has quickly taken the application crown.

Nvidia's CUDA cores are indispensable for training and deploying neural networks and deep learning models, taking advantage of their parallel processing capabilities. To put that into perspective, a dozen Nvidia H100 GPUs can provide the same deep learning equivalent as 2,000 midrange CPUs. That enhanced performance is ideal for complex tasks such as image and speech recognition. Natural Language Processing (NLP) and Large Language Models (LLMs), such as GPT, also benefit from CUDA core processing, making it easier for developers to deploy sophisticated algorithms or enhance applications like chatbots, translation services, and text analysis.

Deep Genomics' BigRNA can accurately predict diseases based on patients' genetic variations.

Nvidia's CUDA technology has also been tasked with healthcare applications, including facilitating faster and more accurate diagnostics via deep learning algorithms. They drive molecular-scale simulations to visualize organs and help predict the effectiveness of treatments. They're also used to analyze complex data from MRIs and CT scans, improving early detection of diseases. Toronto-based Deep Genomics is utilizing CUDA to power deep learning to better understand how genetic variations can lead to disease and how best to treat them with the discovery of new medicines. Tempus is another medical company that's using Nvidia's GPUs for deep learning to speed the analysis of medical images, and its technology is set to be deployed in GE Healthcare MRI machines to help diagnose heart disease.

CUDA core technology has applications in the finance industry, which uses Nvidia GPUs to process large amounts of transaction data, providing banks with real-time fraud detection and risk management. AI algorithms can analyze complex financial patterns, improving market prediction accuracy and investment strategies. The same is true for stock brokerages, who task AI algorithms to execute orders in milliseconds, optimizing financial returns.

Academia has utilized CUDA technology as well, combining it with OpenCL APIs to develop and optimize AI algorithms for new drug discoveries, making GPUs integral to their studies. Institutions such as Stanford University have been using CUDA technology since its release and have used the platform as a base for learning how to program AI algorithms and deep learning models.

...to reduce qubit count needed for studying large data sets

Researchers at Stanford University used CUDA to develop and accelerate the simulation of new QML methods to reduce the qubit count needed to study large data sets.

Researchers at the University of Edinburgh's Quantum Software Lab have also utilized the technology to develop and simulate new QML methods to significantly reduce the qubit count necessary to study large data sets. Using CUDA-Q simulation toolkits and Nvidia GPUs, they were able to overcome scalability issues and simulate complex QML clustering methods on problem sizes up to 25 qubits. The breakthrough is essential for developing quantum-accelerated supercomputing applications.

Retail companies have also jumped on the AI bandwagon, using it to enhance customers' experiences via personalized recommendations and inventory management. Generative AI models take advantage of data science to predict consumer behavior and tailor marketing strategies. Lowe's, for example, uses GPU-accelerated AI for several applications, including supply chain optimization and dynamic pricing models. CUDA technology helps analyze large datasets quickly, which improves demand forecasting and ensures efficient stock replenishment. The company recently partnered with Nvidia to develop computer vision applications, including enhancing self-checkouts to prevent theft or determine if a product was left unintentionally in carts in real-time.

Nvidia’s lead comes back to CUDA

It's easy to see why Nvidia's CUDA technology has propelled the company to the number one spot of high-performance computing by unlocking the full potential of parallel processing with its CUDA architecture. The ability to harness thousands of cores for processing large amounts of data has made the technology a valuable platform for many industries, from healthcare and academia to retail and finance. With its extensive CUDA ecosystem, optimized libraries and hardware innovations, Nvidia has cemented its leadership in the AI boom far above AMD and Intel. As AI applications continue to advance, CUDA looks to remain the gold standard for researchers and developers to push the limits of what's possible.

Grace is a Chicago-based engineer.

https://www.fierceelectronics.com/ai/whys-nvidia-such-beast-its-cuda-thing

samwilde · Nov 4, 2024

At this point you have loads of CUDA software developed. It is kind of like X86 in that it has a huge application library available that other architectures lack. As simple as that.

Daniel Nenni · Nov 5, 2024

samwilde said:
At this point you have loads of CUDA software developed. It is kind of like X86 in that it has a huge application library available that other architectures lack. As simple as that.

Yes, it is all about the ecosystem which cannot be cut and pasted to a competitor's GPU. At some point it may fall like Intel x86 but certainly not today.

KevinK · Nov 5, 2024

Daniel Nenni said:
At some point it may fall like Intel x86 but certainly not today.

x86's "fall" is much more about the rise of other compute paradigms / architectures / ecosystems that are far better suited for new applications (ARM for low power client devices and IoT, GPUs/ASICs for graphics, HPC, machine learning and generative AI). PC unit sales have ranged between 240M and 360M per year for at least the last 25 years. The problem for Intel is that the percentage of volume (wafers) of leading edge silicon that is not PC or x86 server has gone from near zero to something like 70% between smartphones and AI.

hist78 · Nov 5, 2024

KevinK said:
x86's "fall" is much more about the rise of other compute paradigms / architectures / ecosystems that are far better suited for new applications (ARM for low power client devices and IoT, GPUs/ASICs for graphics, HPC, machine learning and generative AI). PC unit sales have ranged between 240M and 360M per year for at least the last 25 years. The problem for Intel is that the percentage of volume (wafers) of leading edge silicon that is not PC or x86 server has gone from near zero to something like 70% between smartphones and AI.

Across the globe, 1.17 billion units of smartphones (234.6 million units were from Apple) were sold in 2023 vs 241.89 million of PCs for the same year. Intel missed the whole smartphone revolution and the huge market.

KevinK · Nov 5, 2024

hist78 said:
Across the globe, 1.17 billion units of smartphones (234.6 million units are from Apple) were sold in 2023 vs 241.89 million of PCs for the same year. Intel missed the whole smartphone revolution and the huge market.

Intel did not respond well to the same Innovator's Dilemma that they used to triumph over minicomputers via the PC, and servers over the likes of Sun and IBM.

The Innovator's Dilemma - Wikipedia

en.wikipedia.org

KevinK · Nov 8, 2024

CUDA's not such a leg up in Gen AI anymore since all the LLM models are built on top of frameworks like PyTorch that make things more portable. But NVIDIA has dozens of other app-specific CUDA libraries for areas outside LLMs that entrench NVIDIA in those market segments, providing steady revenue growth in those segments. The biggest moat for NVIDIA in the LLM / GenAI space is all the primary model development that takes place on NVIDIA HW/SW, plus their new GenAI NEMO/NIMs/Triton software for developing, training and deploying complex LLM based systems (not just the models themselves).

Xebec · Nov 8, 2024

Not discounting CUDA, but Nvidia is also in a position where it's products have significant perf/watt advantages over competitive products. That allows them to stay distanced from the likes of AMD and Intel. Nvidia can win a price war simply by being that much better hardware wise right now.

(Similarly -- AMD's big coup in x86 server really came from perf/watt advantages, not in absolute performance terms.. though Nvidia has BOTH in GPU/AI compute..)

siliconbruh999 · Nov 8, 2024

Xebec said:
Not discounting CUDA, but Nvidia is also in a position where it's products have significant perf/watt advantages over competitive products. That allows them to stay distanced from the likes of AMD and Intel. Nvidia can win a price war simply by being that much better hardware wise right now.

(Similarly -- AMD's big coup in x86 server really came from perf/watt advantages, not in absolute performance terms.. though Nvidia has BOTH in GPU/AI compute..)

There is also doesn't have a second source issue like AMD/Intel has i.e. you can't second source the hardware with the same sw stack

XYang2023 · Nov 8, 2024

KevinK said:
CUDA's not such a leg up in Gen AI anymore since all the LLM models are built on top of frameworks like PyTorch that make things more portable. But NVIDIA has dozens of other app-specific CUDA libraries for areas outside LLMs that entrench NVIDIA in those market segments, providing steady revenue growth in those segments. The biggest moat for NVIDIA in the LLM / GenAI space is all the primary model development that takes place on NVIDIA HW/SW, plus their new GenAI NEMO/NIMs/Triton software for developing, training and deploying complex LLM based systems (not just the models themselves).

It takes time for alternative ecosystem to catchup. I think developers could also use LLMs (code gen) to accelerate this effort. At present, 25% of code used in Google is generated by AI.

Portable SYCL code using oneMKL on AMD, Intel and Nvidia GPUs UXL Q324 - oneAPI

oneapi.io

oneAPI: A Viable Alternative To CUDA* Lock-in

oneAPI programming model - an alternative to CUDA* vendor lock-in for accelerated parallel computing across HPC, AI, and more on CPUs and GPUs.

www.intel.com

Intel CTO suggests porting CUDA code to Intel silicon

This is about ending Nvidia's vendor lock-in, insists Greg Lavender

www.theregister.com

To my knowledge, Triton is developed by OpenAI:

https://openai.com/index/triton/

KevinK · Nov 9, 2024

XYang2023 said:
Triton is developed by OpenAI:

Correct, but NVIDIA offers a super-optimized open-source Triton inference server, that was likely developed in concert with OpenAI.

NVIDIA Dynamo

Deploy, run, and scale AI for any application on any platform.

www.nvidia.com

XYang2023 · Nov 9, 2024

KevinK said:
Correct, but NVIDIA offers a super-optimized open-source Triton inference server, that was likely developed in concert with OpenAI.

NVIDIA Dynamo

Deploy, run, and scale AI for any application on any platform.

www.nvidia.com

https://www.jokeren.tech/slides/triton_intel.pdf

It is also running on Lunar Lake.

GitHub - intel/intel-xpu-backend-for-triton: OpenAI Triton backend for Intel® GPUs

OpenAI Triton backend for Intel® GPUs. Contribute to intel/intel-xpu-backend-for-triton development by creating an account on GitHub.

github.com

When Falcon Shores is available, the software support should be better.

KevinK · Nov 9, 2024

XYang2023 said:
https://www.jokeren.tech/slides/triton_intel.pdf

Great foil set from Intel folks explaining Triton. But it also highlights why NVIDIA’s Triton inference server is a big advantage today. Buried in there, they highlight:
* NVIDIA’s debugging tools as a possible area for future tooling
* NVIDIA’s implementation as the fastest today.

I’m sure Intel offers a working implement for Lunar Lake, and an even better implementation for Falcon Shores. But NVIDIA’s advantage comes from:
* Completeness, including debugging and deployment tools
* Speed
* Being the development platform and “primary port” for virtually all new models.

XYang2023 · Nov 9, 2024

KevinK said:
Great foil set from Intel folks explaining Triton. But it also highlights why NVIDIA’s Triton inference server is a big advantage today. Buried in there, they highlight:
* NVIDIA’s debugging tools as a possible area for future tooling
* NVIDIA’s implementation as the fastest today.

I’m sure Intel offers a working implement for Lunar Lake, and an even better implementation for Falcon Shores. But NVIDIA’s advantage comes from:
* Completeness, including debugging and deployment tools
* Speed
* Being the development platform and “primary port” for virtually all new models.

Nvidia's position in the market, particularly in training, is undoubtedly very strong. Let's see how oneAPI and other competitors perform in 2025. Additionally, oneAPI is not solely supported by Intel; other members of the UXL consortium, such as ARM, Qualcomm, and Broadcom, also provide support.

KevinK · Nov 9, 2024

XYang2023 said:
Let's see how oneAPI and other competitors perform in 2025.

Agreed - we’ll see. Just a note - oneAPI is more of a general purpose acceleration framework like CUDA, but doesn’t include the hundreds of app-specific libraries written for CUDA. Intel has about the same number of software developers as NVIDIA, 15,000, but are spread across many more architectures and market needs, so they are spread thin.

Tanj · Nov 9, 2024

samwilde said:
At this point you have loads of CUDA software developed. It is kind of like X86 in that it has a huge application library available that other architectures lack. As simple as that.

Not quite. It is not binary compatibility. Nvidia builds new compilation for each generation, and apps are recompiled or even retuned for each generation.
The beauty or ugliness of that is just how difficult it is to get a GPU running at full speed. The computation elements are tiny compared to the size of the tensors being manipulated, so in a sense the top level mathematics is being shredded by CUDA and then woven together to run smoothly on those small fast elements is a magical art.
In principle a competing but different GPU can start from CUDA and compile competitively - after all, Nvidia is doing that partially overlapped and partly new every year. So for competition like AMD or Intel with their own GPU-style solutions to compete they need to assemble a compiler team with competitive talents. It is a vertical business..

XYang2023 · Nov 9, 2024

KevinK said:
Agreed - we’ll see. Just a note - oneAPI is more of a general purpose acceleration framework like CUDA, but doesn’t include the hundreds of app-specific libraries written for CUDA. Intel has about the same number of software developers as NVIDIA, 15,000, but are spread across many more architectures and market needs, so they are spread thin.

According to recent news about Intel's restructuring, the company announced plans to reassign software assets to various business divisions. I assume DCAI will receive increased software support.

"And the software and incubation businesses—I assume including quantum and neuromorphic R&D—are being integrated in these core business units to streamline things."

Intel Makes A Slew Of Announcements Supporting Its Turnaround Strategy

Intel's big announcements this week are positive indicators for its turnaround efforts and overall strategy, assuming it can execute the announced moves flawlessly.

www.forbes.com

KevinK · Mar 19, 2025

KevinK said:
Correct, but NVIDIA offers a super-optimized open-source Triton inference server, that was likely developed in concert with OpenAI.

Seems like NVIDIA has lapped other AI hardware companies on the software side again - this time in inference serving, with respect to speed, latency and flexible configuration and deployment.

NVIDIA Dynamo

Deploy, run, and scale AI for any application on any platform.

www.nvidia.com

siliconbruh999 · Mar 19, 2025

KevinK said:
Agreed - we’ll see. Just a note - oneAPI is more of a general purpose acceleration framework like CUDA, but doesn’t include the hundreds of app-specific libraries written for CUDA. Intel has about the same number of software developers as NVIDIA, 15,000, but are spread across many more architectures and market needs, so they are spread thin.

OneAPI is fully Open Source though so this may not be entirely correct there maybe more person working on OneAPI and Intel has many open source Libraries available as well

KevinK · Mar 19, 2025

siliconbruh999 said:
OneAPI is fully Open Source though so this may not be entirely correct there maybe more person working on OneAPI and Intel has many open source Libraries available as well

Dynamo is open source as well. But more importantly, it offers dynamic reallocation and tuning of resources for max throughout or min token latency for each model inference instance running in an entire data center to optimize operations as models go through different phases (pre fill, token generation).

Why’s Nvidia such a beast? It’s that CUDA thing.

Admin

Enter CUDA​

CUDA Applications​

Nvidia’s lead comes back to CUDA​

Active member

Admin

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Enter CUDA

CUDA Applications

Nvidia’s lead comes back to CUDA