Can AMD close the gap? Will Nvidia rule all in AI/ML

blueone · Aug 28, 2023

jwall said:
I recently wrote a memo about this and posted on Linkedin:

NVIDIA’s CORNERSTONE for UNMATCHED DOMINANCE in AI (CUDA)

"We Have No Moat, And Neither Does OpenAI" proclaimed a leaked paper circulated internally at Google. We believe that this is true for most AI businesses.

www.linkedin.com

(I'd welcome feedback to tell me where I'm wrong also)

To the question "Can AMD close the gap?", I think the answer is resoundingly no. The gap is actually a software gap and AMD won't be the ones that close it. Some journalists have been touting PyTorch as the competitor to CUDA but PyTorch was actually developed by Meta and runs on CUDA. So I think you'll have to see Meta / OpenAI / the rest of big tech create a first class machine learning library that gains traction before AMD becomes a viable competitor here.

I remember reading your article on LinkedIn. While I agree that CUDA is an important differentiator for Nvidia, there was one quip you made at the end that made me wonder about you:

Just as you would build a financial model in Excel, you would build your new, clever AI chatbot using PyTorch (or Triton). PyTorch relies on either CUDA or ROCm and similarly, Excel relies on an operating system, either Windows or MacOS. The catch, as there always is, is that PyTorch on ROCm is limited in just the same way that Excel on MacOS is annoyingly inferior to Excel on Windows.

How exactly does MacOS limit the development or execution of Excel as compared to Windows?

As for your assertions about ROCm, I'm still studying what AMD has posted, so I don't feel qualified to make any comments yet.

jwall · Aug 28, 2023

Ah, Microsoft limits the functionality of Excel on mac. Not all keyboard shortcuts or external data imports work, and it doesn't allow VBA programming although I don't make use of that anyway. I was betting that more of my readers were Excel power users for financial modeling. Clearly that analogy fell flat!

blueone · Aug 28, 2023

jwall said:
Ah, Microsoft limits the functionality of Excel on mac. Not all keyboard shortcuts or external data imports work, and it doesn't allow VBA programming although I don't make use of that anyway. I was betting that more of my readers were Excel power users for financial modeling. Clearly that analogy fell flat!

If it makes you feel any better, that poor analogy made me remember your article, and I probably would not have otherwise.

chillfire · Aug 28, 2023

jwall said:
Ah, Microsoft limits the functionality of Excel on mac. Not all keyboard shortcuts or external data imports work, and it doesn't allow VBA programming although I don't make use of that anyway. I was betting that more of my readers were Excel power users for financial modeling. Clearly that analogy fell flat!

Man, I like your article and I get the excel thing. But I think financial industry evolves a lot in the past few years and we don’t use excel to build complicated model that often anymore. We are trying to be as smart as the tech industry to code and use machine learning models as much as possible. Many analysts are still using excel to relatively simple models though. I especially appreciate your emphasis on CUDA which NVDA rolled out in 2006(?). One of mr friends was the early adopter and he finished his whole year job in 15 days back in 2008. I think AI belongs to NVDA for general and more common usages. AMD could have a dent in supercomputing because many HPC guys will write customized code anyway

Tanj · Aug 28, 2023

jwall said:
PyTorch was actually developed by Meta and runs on CUDA. So I think you'll have to see Meta / OpenAI / the rest of big tech create a first class machine learning library that gains traction before AMD becomes a viable competitor here.

PyTorch uses CUDA to target NVDA but since PyTorch 2.0 it has been able to target ROCm (as also can TensorFlow and ONNX) for AMD. The next version of ROCm tuned for MI-300 is expected later this year, at which time we will see meaningful benchmarks for the new AMD machine. The raw perf and memory bandwidth numbers give reason to believe it will be close.

Tanj · Aug 28, 2023

jwall said:
Ah, Microsoft limits the functionality of Excel on mac. Not all keyboard shortcuts or external data imports work, and it doesn't allow VBA programming although I don't make use of that anyway. I was betting that more of my readers were Excel power users for financial modeling. Clearly that analogy fell flat!

Latest versions on Windows have discarded some old shortcuts (I suspect the current team has no personal experience with most shortcuts) and VBA is off by default for security reasons. Enabling it is not obvious. A quick search indicates the same method to enable it is used on the Mac.

KevinK · Aug 28, 2023

@blueone and @jwall , great assessments, plus follow-up. One component of the assessment that seems to be missing is comparable strength and market fit for training vs. inference data center vs inference at the edge. I think one of NVIDIA's big strengths is that the tackles the whole AI stack (multiple hardware architectures, training, inference in the data center, inference at the edge) via their range of hardware and CUDA. Many chip-centric solutions are only targeted at one of these market points. Cerebras is focused on training, Esperanto and Pascaline, plus others are focused on inference, I think Amazon and Google are focused on Data Center training and inference, but not edge inference. Wondering if we might end up with a world where different companies own different use models - one set for super high speed & capacity training, some for integrated training/inference in the data center, and others for edge inference, because each of those might have far different market requirements.

blueone · Aug 28, 2023

KevinK said:
@blueone and @jwall , great assessments, plus follow-up. One component of the assessment that seems to be missing is comparable strength and market fit for training vs. inference data center vs inference at the edge. I think one of NVIDIA's big strengths is that the tackles the whole AI stack (multiple hardware architectures, training, inference in the data center, inference at the edge) via their range of hardware and CUDA. Many chip-centric solutions are only targeted at one of these market points. Cerebras is focused on training, Esperanto and Pascaline, plus others are focused on inference, I think Amazon and Google are focused on Data Center training and inference, but not edge inference. Wondering if we might end up with a world where different companies own different use models - one set for super high speed & capacity training, some for integrated training/inference in the data center, and others for edge inference, because each of those might have far different market requirements.

Good points, especially about edge computing. Thanks for reminding us about Esperanto, which has a strategy that sounds a bit like Tenstorrent at a high level.

A friend reminded me, after reading my original reply, that Microsoft is working on an Athena chip to compete with GPUs, but I haven't heard anything from them recently.

There's a lot of money being spent on AI chips. Reminds me a bit of microprocessors 25 years ago. Not too many microprocessors survived.

KevinK · Aug 28, 2023

blueone said:
Reminds me a bit of microprocessors 25 years ago. Not too many microprocessors survived.

That's kind of what got me thinking about the different use cases for AI chips. There used to be plenty of different architectures of microprocessors and microcontrollers. But the market eventually whittled it down to just a few that addressed all of the then existing market sweet spots (HPC, server/cloud, client PC, mobile client, microcontroller, realtime processor, DSP). AI has created an explosion of new "processor" lifeforms.

Tanj · Aug 28, 2023

True edge computing has to deal with model size for updates. Not just size running through the phone or tablet's IPU but also how you download gigabytes to 100M phones. And how many years you wait until the phones that can run that model roll out, and the possibility that each year's hot AI model cannot be run by phones more than 2 years old. 10 years ago that worked, but now the phones have evolved into expensive beasts which users may keep for 5 years or more.

It may make more sense to run AI at the cell towers, or at thousands of CoLo hotels located near customers.

I'm not so worried about the variety of inferencing. The compute for inference is settling down, though some of the sparsity tricks may need licensing. Hard to tell since there are so many ways to compress models. It is more interesting to wonder what kind of memory is going to win. I think it will not be DRAM, allowing a few years for alternatives to get to market. Memory is dominating the energy and throughput problems at the edge.

Arthur Hanson · Aug 29, 2023

Any thoughts from anyone on quantum computing as a AI service competing since it would have to operate in a data center format?

blueone · Aug 29, 2023

Announcing Cloud TPU v5e and A3 GPUs in GA | Google Cloud Blog

The new Cloud TPU v5e is the most cost-efficient, versatile, and scalable Cloud TPU to date, and the A3 Supercomputer is now generally available to power your large-scale AI models.

cloud.google.com

Tanj · Aug 29, 2023

Arthur Hanson said:
Any thoughts from anyone on quantum computing as a AI service competing since it would have to operate in a data center format?

Completely off the map. Recent AI advances have been due to phase changes occuring with the size of model and the amount of knowledge compressed into it. Quantum computing has no memory capacity to speak of.

Arthur Hanson · Aug 29, 2023

Tanj said:
Completely off the map. Recent AI advances have been due to phase changes occuring with the size of model and the amount of knowledge compressed into it. Quantum computing has no memory capacity to speak of.

Thanks, Why are some making a big deal out of Quantum computing if it has no memory. Will it ever be able to use external memory or would this defeat the speed and purpose?

Tanj · Aug 29, 2023

Well, AI is not the only reason to do computing. I gave some use cases in your other thread on QC.

Search

Can AMD close the gap? Will Nvidia rule all in AI/ML

blueone

Well-known member

NVIDIA’s CORNERSTONE for UNMATCHED DOMINANCE in AI (CUDA)

jwall

New member

blueone

Well-known member

chillfire

Member

Tanj

Well-known member

Tanj

Well-known member

KevinK

Well-known member

blueone

Well-known member

KevinK

Well-known member

Tanj

Well-known member

Arthur Hanson

Well-known member

blueone

Well-known member

Announcing Cloud TPU v5e and A3 GPUs in GA | Google Cloud Blog

Tanj

Well-known member

Arthur Hanson

Well-known member

Tanj

Well-known member