Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/index.php?threads/can-amd-close-the-gap-will-nvidia-rule-all-in-ai-ml.18642/
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021370
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

Can AMD close the gap? Will Nvidia rule all in AI/ML

Arthur Hanson

Well-known member
Is it possible for AMD to close the gap for Nvidia or at least carve out an area for themselves in the market? Is there any sign of anyone competing with Nvidia in the near future? Has Intel become the past and do they have any area where they can dominate or is Intel slated to a down hill run against Samsung and TSM. Any thoughts or comments appreciated. Thanks
 
There's a lot more going on in AI chip development than Nvidia and AMD. GPUs happen to be useful for recurrent neural networks, like transformers, which are used by large language model AI software. (You can do a search for all of these terms, and get explanations for what they mean. But keep in mind that understanding the explanations does require some knowledge of moderately advanced computer science concepts.) GPUs are designed to a general computing model called Single Instruction Multiple Data (SIMD), which means that a single instruction can be simultaneously executed in parallel on multiple data streams. GPUs consist of hundreds or thousands of arithmetic-logic units (ALUs), each of which execute a simple instruction set compared to general purpose CPUs, but there is sufficient generality to allow GPUs to be applied to different problem spaces, such as graphics, database processing, image processing... there's actually a rather long list of applications, and AI applications are just the currently hot addition.

Nvidia's GPU development is, IMO, way ahead of anything AMD has or Intel has in that product category. Nvidia has the best GPUs close-coupled with many-core Arm-based server CPUs, and Nvidia has far and away the best system interconnect strategy of these three companies with NVlink and InfiniBand. AMD and Intel, at best, can only apply their coherent CPU interconnects and CXL, which are very limited by comparison. Even CXL, while Intel conceived it as an accelerator to CPU interconnect, has been mostly repositioned to support remote shared memory pools, and does not seemed focused on the CPU-accelerator problem. And on top of all of these advantages, Nvidia's CUDA ecosystem is years ahead of the open software competition (PyTorch).

But there are other strategies for AI processing, that may be better than Nvidia's in the long run.

Google, for several years now, has been using custom ASICs called TPUs (Tensor Processing Units), which are specialized ASICs consisting of multiple matrix-multiply units, each consisting of a 128x128 systolic array capable of 16,384 multiply-accumulate operations per clock cycle in TPU version 4.


It's called a Tensor Processing Unit because tensors are mathematical operations from linear algebra used to operate on vectors and matrices, which are the basic data structures in neural networks. In other words, a TPU is a custom hardware implementation of what people are programming GPUs to do. Google assembles multiple TPUs on a board, and then networks thousands of them together on a leading-edge end-to-end optical interconnect called OCS, which incorporates optical switches based on MEMS chips. I think it's amazing. From a technical standpoint, when I read about TPUs and OCS, I'm much more impressed by Google's technology than I am anything Nvidia has. The question is, can Google's AI application technology that runs on these TPU systems be leading edge? I don't know, but they claim to be running various LLMs on these systems already.


Amazon has announced they have also developed their own inference ASIC, called Inferentia Accelerators. As you can see from their website, they are already working with customers and have a software ecosystem.


So Google and Amazon could win the AI-in-the-cloud market by just being good enough at far lower cost (and likely lower power consumption).

Intel, while it appears to still be working on GPUs, is extending the x86 instruction set for fast matrix extensions, and seems to be working from the bottom up rather than the top down in the AI market:


Tenstorrent appears to have similar ideas to Intel, and is extending RISC-V to better process AI applications.

(The Google-Intel-Tenstorrent situation is somewhat weird, because Dave Patterson, one of the leaders of the RISC-V initiative at UC Berkeley, is a distinguished engineer at Google working on TPUs, while Tenstorrent's CEO, Jim Keller, was previously an SVP at Intel.)

It might be that the CPU-based AI capabilities are just easier to develop applications for, and much cheaper to start small and grow, so this strategy shouldn't be quickly dismissed.


And then there's Cerebras, a start-up currently valued at $4B+, which uses wafer-scale integration to produce the world's largest monolithic chips. The chips have hundreds of thousands of custom AI processors on each wafer-chip, interconnected by the world's fastest interconnection network (because it's on-die), but they're only available as fully proprietary systems. That means Cerebras is more like the old Cray full-custom supercomputers than the systems Nvidia GPUs are used in, but if we're talking pure technical capability, Cerebras is the current world champion. Why is Cerebras even mentioned in the same breath as Nvidia? Because most of Nvidia's datacenter story is now about assembling GPUs into what are really AI supercomputers,


Note the write-up on their website about the creation of a 4 exaflop AI supercomputer.

So I'm not so sure that Nvidia maintains its current far-and-away leadership position in the long run. And it doesn't look like it's AMD that will be the challenger. To me, Google looks most impressive in the long run as a challenger.
 
Last edited:
There's a lot more going on in AI chip development than Nvidia and AMD. GPUs happen to be useful for recurrent neural networks, like transformers, which are used by large language model AI software. (You can do a search for all of these terms, and get explanations for what they mean. But keep in mind that understanding the explanations does require some knowledge of moderately advanced computer science concepts.) GPUs are designed to a general computing model called Single Instruction Multiple Data (SIMD), which means that a single instruction can be simultaneously executed in parallel on multiple data streams. GPUs consist of hundreds or thousands of arithmetic-logic units (ALUs), each of which execute a simple instruction set compared to general purpose CPUs, but there is sufficient generality to allow GPUs to be applied to different problem spaces, such as graphics, database processing, image processing... there's actually a rather long list of applications, and AI applications are just the currently hot addition.

Nvidia's GPU development is, IMO, way ahead of anything AMD has or Intel has in that product category. Nvidia has the best GPUs close-coupled with many-core Arm-based server CPUs, and Nvidia has far and away the best system interconnect strategy of these three companies with NVlink and InfiniBand. AMD and Intel, at best, can only apply their coherent CPU interconnects and CXL, which are very limited by comparison. Even CXL, while Intel conceived it as an accelerator to CPU interconnect, has been mostly repositioned to support remote shared memory pools, and does not seemed focused on the CPU-accelerator problem. And on top of all of these advantages, Nvidia's CUDA ecosystem is years ahead of the open software competition (PyTorch).

But there are other strategies for AI processing, that may be better than Nvidia's in the long run.

Google, for several years now, has been using custom ASICs called TPUs (Tensor Processing Units), which are specialized ASICs consisting of multiple matrix-multiply units, each consisting of a 128x128 systolic array capable of 16,384 multiply-accumulate operations per clock cycle in TPU version 4.


It's called a Tensor Processing Unit because tensors are mathematical operations from linear algebra used to operate on vectors and matrices, which are the basic data structures in neural networks. In other words, a TPU is a custom hardware implementation of what people are programming GPUs to do. Google assembles multiple TPUs on a board, and then networks thousands of them together on a leading-edge end-to-end optical interconnect called OCS, which incorporates optical switches based on MEMS chips. I think it's amazing. From a technical standpoint, when I read about TPUs and OCS, I'm much more impressed by Google's technology than I am anything Nvidia has. The question is, can Google's AI application technology that runs on these TPU systems be leading edge? I don't know, but they claim to be running various LLMs on these systems already.


Amazon has announced they have also developed their own inference ASIC, called Inferentia Accelerators. As you can see from their website, they are already working with customers and have a software ecosystem.


So Google and Amazon could win the AI-in-the-cloud market by just being good enough at far lower cost (and likely lower power consumption).

Intel, while it appears to still be working on GPUs, is extending the x86 instruction set for fast matrix extensions, and seems to be working from the bottom up rather than the top down in the AI market:


Tenstorrent appears to have similar ideas to Intel, and is extending RISC-V to better process AI applications.

(The Google-Intel-Tenstorrent situation is somewhat weird, because Dave Patterson, one of the leaders of the RISC-V initiative at UC Berkeley, is a distinguished engineer at Google working on TPUs, while Tenstorrent's CEO, Jim Keller, was previously an SVP at Intel.)

It might be that the CPU-based AI capabilities are just easier to develop applications for, and much cheaper to start small and grow, so this strategy shouldn't be quickly dismissed.


And then there's Cerebras, a start-up currently valued at $4B+, which uses wafer-scale integration to produce the world's largest monolithic chips. The chips have hundreds of thousands of custom AI processors on each wafer-chip, interconnected by the world's fastest interconnection network (because it's on-die), but they're only available as fully proprietary systems. That means Cerebras is more like the old Cray full-custom supercomputers than the systems Nvidia GPUs are used in, but if we're talking pure technical capability, Cerebras is the current world champion. Why is Cerebras even mentioned in the same breath as Nvidia? Because most of Nvidia's datacenter story is now about assembling GPUs into what are really AI supercomputers,


Note the write-up on their website about the creation of a 4 exaflop AI supercomputer.

So I'm not so sure that Nvidia maintains its current far-and-away leadership position in the long run. And it doesn't look like it's AMD that will be the challenger. To me, Google looks most impressive in the long run as a challenger.
Thanks for the excellent reply and information. Cerebras is a company I have followed off and on and will get back to looking carefully at. Who do you feel leads in memory for AI (Micron, Samsung) and outside of ASML, provides key equipment and materials, AMAT?
 
Thanks for the excellent reply and information. Cerebras is a company I have followed off and on and will get back to looking carefully at. Who do you feel leads in memory for AI (Micron, Samsung) and outside of ASML, provides key equipment and materials, AMAT?
Since Cerebras is a private company, outsiders cannot invest yet, but it's my guess that a very large existing public company makes them an acquisition offer they can't refuse. I think probably a systems or cloud computing company. If you look at their "company" web page on their site, their advisors are a who's-who of famous names in the industry. Sam Altman, Andy Bechtolsheim, Nick McKeown, David Perlmutter, Pradeep Sindhu, and Lip-Bu Tan among them. All they're missing is a Kardashian. Dark horses in that acquisition race might be IBM or Oracle someday.

I don't have extensive knowledge about the memory market or chip fabrication equipment. Searching for "HBM and AI" will give you a lot to read about for memory.
 
Last edited:
Good info Mr. Blue. How about I * R and accumulate and other analog methods of matrix multiplication and accumulation? Can us neanderthals compete?
 
There's a lot more going on in AI chip development than Nvidia and AMD. GPUs happen to be useful for recurrent neural networks, like transformers, which are used by large language model AI software. (You can do a search for all of these terms, and get explanations for what they mean. But keep in mind that understanding the explanations does require some knowledge of moderately advanced computer science concepts.) GPUs are designed to a general computing model called Single Instruction Multiple Data (SIMD), which means that a single instruction can be simultaneously executed in parallel on multiple data streams. GPUs consist of hundreds or thousands of arithmetic-logic units (ALUs), each of which execute a simple instruction set compared to general purpose CPUs, but there is sufficient generality to allow GPUs to be applied to different problem spaces, such as graphics, database processing, image processing... there's actually a rather long list of applications, and AI applications are just the currently hot addition.

Nvidia's GPU development is, IMO, way ahead of anything AMD has or Intel has in that product category. Nvidia has the best GPUs close-coupled with many-core Arm-based server CPUs, and Nvidia has far and away the best system interconnect strategy of these three companies with NVlink and InfiniBand. AMD and Intel, at best, can only apply their coherent CPU interconnects and CXL, which are very limited by comparison. Even CXL, while Intel conceived it as an accelerator to CPU interconnect, has been mostly repositioned to support remote shared memory pools, and does not seemed focused on the CPU-accelerator problem. And on top of all of these advantages, Nvidia's CUDA ecosystem is years ahead of the open software competition (PyTorch).

But there are other strategies for AI processing, that may be better than Nvidia's in the long run.

Google, for several years now, has been using custom ASICs called TPUs (Tensor Processing Units), which are specialized ASICs consisting of multiple matrix-multiply units, each consisting of a 128x128 systolic array capable of 16,384 multiply-accumulate operations per clock cycle in TPU version 4.


It's called a Tensor Processing Unit because tensors are mathematical operations from linear algebra used to operate on vectors and matrices, which are the basic data structures in neural networks. In other words, a TPU is a custom hardware implementation of what people are programming GPUs to do. Google assembles multiple TPUs on a board, and then networks thousands of them together on a leading-edge end-to-end optical interconnect called OCS, which incorporates optical switches based on MEMS chips. I think it's amazing. From a technical standpoint, when I read about TPUs and OCS, I'm much more impressed by Google's technology than I am anything Nvidia has. The question is, can Google's AI application technology that runs on these TPU systems be leading edge? I don't know, but they claim to be running various LLMs on these systems already.


Amazon has announced they have also developed their own inference ASIC, called Inferentia Accelerators. As you can see from their website, they are already working with customers and have a software ecosystem.


So Google and Amazon could win the AI-in-the-cloud market by just being good enough at far lower cost (and likely lower power consumption).

Intel, while it appears to still be working on GPUs, is extending the x86 instruction set for fast matrix extensions, and seems to be working from the bottom up rather than the top down in the AI market:


Tenstorrent appears to have similar ideas to Intel, and is extending RISC-V to better process AI applications.

(The Google-Intel-Tenstorrent situation is somewhat weird, because Dave Patterson, one of the leaders of the RISC-V initiative at UC Berkeley, is a distinguished engineer at Google working on TPUs, while Tenstorrent's CEO, Jim Keller, was previously an SVP at Intel.)

It might be that the CPU-based AI capabilities are just easier to develop applications for, and much cheaper to start small and grow, so this strategy shouldn't be quickly dismissed.


And then there's Cerebras, a start-up currently valued at $4B+, which uses wafer-scale integration to produce the world's largest monolithic chips. The chips have hundreds of thousands of custom AI processors on each wafer-chip, interconnected by the world's fastest interconnection network (because it's on-die), but they're only available as fully proprietary systems. That means Cerebras is more like the old Cray full-custom supercomputers than the systems Nvidia GPUs are used in, but if we're talking pure technical capability, Cerebras is the current world champion. Why is Cerebras even mentioned in the same breath as Nvidia? Because most of Nvidia's datacenter story is now about assembling GPUs into what are really AI supercomputers,


Note the write-up on their website about the creation of a 4 exaflop AI supercomputer.

So I'm not so sure that Nvidia maintains its current far-and-away leadership position in the long run. And it doesn't look like it's AMD that will be the challenger. To me, Google looks most impressive in the long run as a challenger.

Thanks for posting. Most people think the only players are Nvidia ... then maybe AMD & Intel in the future.
 
Thanks, Blueone, for the primer.

In spite of Nvidia's big lead in AI to date, I still feel the field is in its infancy, with lots of opportunities for many yet, including all those you mentioned. Fortunes to be won and lost; right now it's pretty nice if you're selling the picks and shovels.

I believe AMD will do fine. Not lights out like Nvidia, but as AMD establish themselves Nvidia's margins will come back down to Earth. After all this is what the customer base want to see.

To me AI is a very broad term so theoretically many providers could succeed and the field would bifurcate as time goes on. The IBM work intrigues me a lot, another application of phase change materials for in-memory compute and implementation of neurons. Instinctively, this is probing how the brain actually works. After all the struggles over earlier PCM work, it would be great to see something actually reach high volume production.
 
The IBM work intrigues me a lot, another application of phase change materials for in-memory compute and implementation of neurons. Instinctively, this is probing how the brain actually works. After all the struggles over earlier PCM work, it would be great to see something actually reach high volume production.
Me too. IBM produces a lot of fascinating research results; I follow IBM Research regularly. But IBM's success in turning that great research into revenue-producing products seems disappointing. It feels like they could do better.
 
Hello all: I don't have nearly as much technical knowledge and ability as many on this board. But I've enjoyed reading your posts. So is everyone discounting the importance and impact of AMD's latest MI300X chip that has more memory density and memory bandwidth than NVDA's H100 GPU, resulting in fewer GPU's needed, and hence represents a way to make cutting edge technology more accessible and affordable than NVDA's offerings? Could this be a way for AMD to gain market share? Interested to hear folks weigh in on whether this line of reasoning has any merit. Thanking you in advance for continuing to educate me.
 
Hello all: I don't have nearly as much technical knowledge and ability as many on this board. But I've enjoyed reading your posts. So is everyone discounting the importance and impact of AMD's latest MI300X chip that has more memory density and memory bandwidth than NVDA's H100 GPU, resulting in fewer GPU's needed, and hence represents a way to make cutting edge technology more accessible and affordable than NVDA's offerings? Could this be a way for AMD to gain market share? Interested to hear folks weigh in on whether this line of reasoning has any merit. Thanking you in advance for continuing to educate me.
First of all, comparisons between AMD's product that hasn't been delivered in volume yet, and one that has been shipping, like the H100, is what marketing people do, but it's not really an apples to apples comparison. Nvidia recently projected they're going to ship over 500,000 H100s this year alone, so it'll be more interesting to see what Nvidia has on their roadmap. Nvidia also has specialized tensor cores in their GPU design, and perhaps I missed it, but I haven't seen comparable technology in the AMD MI300. Finally, given the significant software development differences between AMD (PyTorch) and Nvidia (CUDA), system deployers are probably buying a roadmap more than an individual product.

That said, considering the oft-mentioned shortage of high-end datacenter GPUs, I think AMD will likely sell every high-end GPU they can build for 2-3 years. I think Nvidia continues for 3-5 years as the overall winner, but I think Google and Amazon may be the emerging competition for both of them. Everyone underestimates these guys. Google's Argos video codec chip has displaced many hundreds of thousands of CPUs in Google's YouTube systems. Amazon's Graviton CPU is on the way to having the same impact in storage systems and general purpose servers in AWS. I think what Google and Amazon have accomplished in chip design is under-appreciated, and portends a different sort of competition for both Nvidia and AMD.
 
Last edited:
First of all, comparisons between AMD's product that hasn't been delivered in volume yet, and one that has been shipping, like the H100, is what marketing people do, but it's not really an apples to apples comparison. Nvidia recently projected they're going to ship over 500,000 H100s this year alone, so it'll be more interesting to see what Nvidia has on their roadmap. Nvidia also has specialized tensor cores in their GPU design, and perhaps I missed it, but I haven't seen comparable technology in the AMD MI300. Finally, given the significant software development differences between AMD (PyTorch) and Nvidia (CUDA), system deployers are probably buying a roadmap more than an individual product.

That said, considering the oft-mentioned shortage of high-end datacenter GPUs, I think AMD will likely sell every high-end GPU they can build for 2-3 years. Overall, I think Nvidia continues for 3-5 years as the overall winner, but I think Google and Amazon may be the emerging competition for both of them. Everyone underestimates these guys. Google's Argos video codec chip has displaced many hundreds of thousands of CPUs in Google's YouTube systems. Amazon's Graviton CPU is on the way to having the same impact in storage systems and general purpose servers in AWS. I think what Google and Amazon has accomplished in chip design is under-appreciated, and portends a different sort of competition for both Nvidia and AMD.
Blueone: I appreciate the insights you provide. I have long thought AMD would be NVDA's chief rival in the GPU space. But the insights here about what Google and Amazon are doing has required me to think again and to do more homework. I get the roadmap vs. individual product notion, but I do think AMD could offer a significant savings to potential customers looking to access cutting edge technology at a more affordable price. I assume you don't believe that AMD's so-called matrix cores are anything like an equivalent to NVDA's tensor cores. Thanks for your response.
 
Blueone: I appreciate the insights you provide. I have long thought AMD would be NVDA's chief rival in the GPU space. But the insights here about what Google and Amazon are doing has required me to think again and to do more homework. I get the roadmap vs. individual product notion, but I do think AMD could offer a significant savings to potential customers looking to access cutting edge technology at a more affordable price. I assume you don't believe that AMD's so-called matrix cores are anything like an equivalent to NVDA's tensor cores. Thanks for your response.
Ah ha! I thought all along that AMD must have the equivalent of Nvidia's tensor cores in their datacenter GPUs, but AMD's architecture material I was able to find didn't call them out until I knew the magic words to search for - Matrix Cores. So, thanks to you, I found this:


I need some time to study this stuff.
 
Last edited:
Ah ha! I thought all along that AMD must have the equivalent of Nvidia's tensor cores in their datacenter CPUs, but AMD's architecture material I was able to find didn't call them out until I knew the magic words to search for - Matrix Cores. So, thanks to you, I found this:


I need some time to study this stuff.
Ok. I look forward to hearing your more expert opinion on this issue.
 
Ok. I look forward to hearing your more expert opinion on this issue.
Let's not get carried away. I don't consider myself an expert on this material. My knowledge is relatively high-level, by my standards. But this is a semiconductor forum and AI-ML experts don't naturally hang around here.
 
I recently wrote a memo about this and posted on Linkedin:
(I'd welcome feedback to tell me where I'm wrong also)

To the question "Can AMD close the gap?", I think the answer is resoundingly no. The gap is actually a software gap and AMD won't be the ones that close it. Some journalists have been touting PyTorch as the competitor to CUDA but PyTorch was actually developed by Meta and runs on CUDA. So I think you'll have to see Meta / OpenAI / the rest of big tech create a first class machine learning library that gains traction before AMD becomes a viable competitor here.
 
Back
Top