What architectural modifications do you expect AI accelerator providers to make over the next ten years in order to meet the computation demand?

Hunterdolphin · Apr 8, 2023

Since the beginning of this machine learning epoch, the semiconductor industry has adopted a myriad of intricate architecture changes, such as the incorporation of mixed-precision tensor-based processing units into their hardware, in order to meet the demand for computation within the machine learning industry.

Looking forward, what additional hardware changes do you anticipate leading application-specific integrated circuit suppliers, such as Nvidia and Google, will introduce over the next decade in order to keep pace with the machine learning industry's expanding demand for greater computing power?

cliff · Apr 8, 2023

Long strips of AP acting as an antenna that will communicate with other chiplets within the SIP, all with their own transceivers to save on I/O. Diffused unstable molecules that will allow for self powering. All with lead packages with little holes to allow oil to flow through, all at an NRE budget, including mask sets, of just under $1B.

cliff · Apr 8, 2023

This can be accomplished through larger teams creating 10,000 page specifications by more diverse employees meeting face to face (flying in on corporate jets of course) and lots of USG funding.

Tanj · Apr 8, 2023

Ooh, Cliff is in a good mood. Seriously Mr. Dolphin, the models are evolving so fast, and tend to have different detailed optimizations depending on who authors them, so who knows? The overall trends are that big, dense arrays done really efficiently are a great all purpose nutcracker, and that getting too fancy may slow you down compared with simple brute force. And the details of feedback between layers, and in conversational AI the details of how you used the evolving context window, can make big changes in how easy it is to build a pipeline that can be shared.

At the moment everyone is off doing their own thing. The biggest advance is probably in EDA to allow them to generate ASICs with a shorter dev cycle. Another big area is how to use more SRAM (100x less energy per bit than DRAM) and how to optimize DRAM interface power. The computation arrays seem relatively easy.

Chris9594 · Apr 9, 2023

Tanj said:
The overall trends are that big, dense arrays done really efficiently are a great all purpose nutcracker, and that getting too fancy may slow you down compared with simple brute force. And the details of feedback between layers, and in conversational AI the details of how you used the evolving context window, can make big changes in how easy it is to build a pipeline that can be shared.

At the moment everyone is off doing their own thing. The biggest advance is probably in EDA to allow them to generate ASICs with a shorter dev cycle. Another big area is how to use more SRAM (100x less energy per bit than DRAM) and how to optimize DRAM interface power. The computation arrays seem relatively easy.

I have some work to show @Tanj and others, but I am all in on my view to cutting transistor counts by 10s to 100s of millions to deliver more data with less power and latency.

My view is that the industry will grasp the potential cut through much of the complexity with 'intelligent' (neuromorphic) designs that will reduce memory use and increase raw compute capabilities through efficiencies.

Complexity is addictive. I think that NVIDIA and Broadcom will have the hardest time to adjust. I see data center and AI operators may be quicker to move and new players who see the power of simplicity will cause a dramatic industry reordering.

Chris9594 · Apr 9, 2023

cliff said:
This can be accomplished through larger teams creating 10,000 page specifications by more diverse employees meeting face to face (flying in on corporate jets of course) and lots of USG funding.

I feel waaaay out of place with my ~20 block PowerPoint design. Granted such specifications rely heavily on those analog automated routing solutions. ;-) I could look at printing it on much thicker paper so as to qualify for the billions of funding.

Tanj · Apr 10, 2023

Chris9594 said:
I have some work to show @Tanj and others, but I am all in on my view to cutting transistor counts by 10s to 100s of millions to deliver more data with less power and latency.

My view is that the industry will grasp the potential cut through much of the complexity with 'intelligent' (neuromorphic) designs that will reduce memory use and increase raw compute capabilities through efficiencies.

Complexity is addictive. I think that NVIDIA and Broadcom will have the hardest time to adjust. I see data center and AI operators may be quicker to move and new players who see the power of simplicity will cause a dramatic industry reordering.

Parallelism is effective. ISSCC had papers from several different groups with access to IP blocks for BF16 or Int16 multiply-add under 0.1pJ per op, which allows you to build 100Tops chips with 10W. These will have matrix multiplies for 128x128 and upwards, operating at low clock rates for efficiency.

The explosion in AI is happening due to an emergent behavior, a phase change is another analogy, that is happening at very large scale. This is a qualitatively different domain than recognizing cat pictures or even face recognition in a crowd. While there have been some interesting spin-offs trying to pare back the size of the models, even the best of those are still huge and they lose some capability.

So, while you for sure can do some nice industrial vision with small transistor counts or neuromorphic arrays, the scale is like comparing a teaspoon to a bulldozer and expecting them both to level a mountain. Nope, the revolution occurring is not yesterday's AI market, which still exists, but the explosion is happening elsewhere.

Also, when the parallelism goes below the size of the model arrays then the clock rates go up. If the arrays are not big enough to be expensive then the lowest power is likely to be with more parallelism and slower clock. Transistors are cheap.

Hunterdolphin · Apr 11, 2023

Tanj said:
Parallelism is effective. ISSCC had papers from several different groups with access to IP blocks for BF16 or Int16 multiply-add under 0.1pJ per op, which allows you to build 100Tops chips with 10W. These will have matrix multiplies for 128x128 and upwards, operating at low clock rates for efficiency.

The explosion in AI is happening due to an emergent behavior, a phase change is another analogy, that is happening at very large scale. This is a qualitatively different domain than recognizing cat pictures or even face recognition in a crowd. While there have been some interesting spin-offs trying to pare back the size of the models, even the best of those are still huge and they lose some capability.

So, while you for sure can do some nice industrial vision with small transistor counts or neuromorphic arrays, the scale is like comparing a teaspoon to a bulldozer and expecting them both to level a mountain. Nope, the revolution occurring is not yesterday's AI market, which still exists, but the explosion is happening elsewhere.

Also, when the parallelism goes below the size of the model arrays then the clock rates go up. If the arrays are not big enough to be expensive then the lowest power is likely to be with more parallelism and slower clock. Transistors are cheap.

If I understand correctly, your suggesting that the forthcoming advancements in artificial intelligence hardware accelerators shall precipitate an exponential enhancement in tensor multiplication capacity. For example, transitioning from the current 32x32 dimensions in current hardware accelerators to a 128x128 configuration in subsequent iterations.

Basically larger matrix multipliers, and higher capacity SRAM seems to be the most ideal prediction.

Tanj · Apr 11, 2023

Currently TPUv4 uses 128x128. It went into production 2 years ago. I believe the arrays can get denser with a few tricks on the design, but it is not a steep exponential for the longer term. Arithmetic circuits are ancient and mature.

SRAM is pretty much stalled since it has near perfect wiring and fin layout. I believe it is scheduled to double in density around the CFET era due to new vertical layout. There can also be advantage in density and cost by manufacturing on an SRAM-optimal process, as can be seen with the doubling of density in AMD's V-cache chip relative to the cores' chip. Interfacing to HBM is also improving. With 8-bit operands a set of 3 HBM3 can saturate a 128x128 matmul in synchronous streaming.

Optimization by algorithm will be most important.

Search

What architectural modifications do you expect AI accelerator providers to make over the next ten years in order to meet the computation demand?

Hunterdolphin

New member

cliff

Active member

cliff

Active member

Tanj

Well-known member

Chris9594

Member

Chris9594

Member

Tanj

Well-known member

Hunterdolphin

New member

Tanj

Well-known member