Intel Arc B580

XYang2023 · Jan 4, 2025

Xebec said:
Thanks for the articles -- the first link was really interesting with how it's sorta not parallel when training but you do get benefits of multiple GPUs. The extra bandwidth of NVLink makes a lot of sense for this.

I think today the only PCIe 5.0 GPUs available are the new Blackwell professional GPUs, and the future 50 series?

You might be right:

What GPU has PCIe 5.0? Do any graphics cards support PCIe 5 in 2024?

There are no mainstream GPUs that support PCIe 5.0 - however we think they could be just on the horizon. Here's what you need to know.

www.pcguide.com

XYang2023 · Jan 7, 2025

My analysis/hypothesis:

Comparison of Compute Performance: Nvidia RTX 5090, RTX 5070, RTX 4090, and Intel B580, B770 GPUs

x.com

XYang2023 · Jan 7, 2025

The FP16 performances of B580 and 5070 are comparable.

siliconbruh999 · Jan 7, 2025

XYang2023 said:
The FP16 performances of B580 and 5070 are comparable.

View attachment 2645

He is right Nvidia Published Number with Int 4 and Sparsity so to get unsparse number divide by 2 and than again divide by 2 to convert smaller to bigger so
1000/(2*2*2) = 125TOPS FP16

XYang2023 · Jan 7, 2025

siliconbruh999 said:
He is right Nvidia Published Number with Int 4 and Sparsity so to get unsparse number divide by 2 and than again divide by 2 to convert smaller to bigger so
1000/(2*2*2) = 125TOPS FP16

Does "1000 AI TOPs" mean INT4 or FP4?

What do sparse and dense mean in this context?

I initially thought, to convert FP4 to FP16 is by dividing FP4 TOPs by 4.

XYang2023 · Jan 7, 2025

siliconbruh999 said:
He is right Nvidia Published Number with Int 4 and Sparsity so to get unsparse number divide by 2 and than again divide by 2 to convert smaller to bigger so
1000/(2*2*2) = 125TOPS FP16

Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT | NVIDIA Technical Blog

○ TensorRT is an SDK for high-performance deep learning inference, and TensorRT 8.0 introduces support for sparsity that uses sparse tensor cores on NVIDIA Ampere GPUs. It can accelerate networks by…

developer.nvidia.com

siliconbruh999 · Jan 7, 2025

XYang2023 said:
Does "1000 AI TOPs" mean INT4 or FP4?

What do sparse and dense mean in this context?

I initially thought, to convert FP4 to FP16 is by dividing FP4 TOPs by 4.

Nvidia Marketing usually puts lowest precision numbers with sparsity so the 1000 TOPS were Int4/FP4 with Sparsity here is a look at their ada Whitepaper using sparsity just 2X the theoretical output so you have to account for that and to divide it by additional 2 in account for sparsity

XYang2023 · Jan 8, 2025

I dig a bit deeper on AI TOPS. I could be wrong with some of the stuffs.

XYang2023 · Jan 9, 2025

A person commented on my latest video, pointing out that Intel’s ARC GPU does not support sparse acceleration. Here is the comment:

“5:49 Intel ARC does not accelerate sparse matrix multiplication. 1× FP16 = 1× BF16 = 2× INT8 = 4× INT4 = 4× INT2. So, 117 TFLOPS FP16 is 468 TOPS INT4. And these are only maximum theoretical values, which are unlikely to be achievable in real applications.”

He also mentioned that the performance for INT2 is the same as INT4.

I came across this paper:
PERFORMANCE OPTIMIZATION OF DEEP LEARNING SPARSE MATRIX KERNELS ON INTEL MAX SERIES GPU
https://arxiv.org/pdf/2311.00368

My questions are:

1. How to validate what he says? Is there any documentation to check?
2. If INT2 performance is the same as INT4, why offer support for INT2 at all?
3. What is the underlying reason for lacking sparse acceleration? Is it due to oneAPI/software?

siliconbruh999 · Jan 9, 2025

XYang2023 said:
A person commented on my latest video, pointing out that Intel’s ARC GPU does not support sparse acceleration. Here is the comment:

He also mentioned that the performance for INT2 is the same as INT4.

I came across this paper:
PERFORMANCE OPTIMIZATION OF DEEP LEARNING SPARSE MATRIX KERNELS ON INTEL MAX SERIES GPU
https://arxiv.org/pdf/2311.00368

My questions are:

1. How to validate what he says? Is there any documentation to check?

2. If INT2 performance is the same as INT4, why offer support for INT2 at all?

3. What is the underlying reason for lacking sparse acceleration? Is it due to oneAPI/software?

He is saying the truth The lowest precision is INT 2 but it is at the same rate as Int 4 on Intel GPUs and yes they don't support Sparsity so max TOPs remains the same as for offering support there is nothing wrong with having support but everything wrong with having it same way NVDA still has FP64 Support but is lackluster

https://cdrdv2-public.intel.com/824434/2024_Intel_Tech%20Tour%20TW_Xe2%20and%20Lunar%20Lakes%20GPU.pd.

For calculating TOPs you can do
Ops/Cycle * Clock Frequency * No. Of XMX/Tensor units
Apparently B580
160 XMX units
And from the chart 8 XMX core can do 4096 ops/clk and so
20*4096*2.850/1000(Boost clock for B580 apparently this is what at TOPS are calculated)
It is around 233 TOPS Of Int8 as stated on Intel ark.

You should definitely edit your video to correct the data point

XYang2023 · Jan 9, 2025

siliconbruh999 said:
He is saying the truth The lowest precision is INT 4 on Intel GPUs and yes they don't support Sparsity so max TOPs remains the same as for offering support there is nothing wrong with having support but everything wrong with having it same way NVDA still has FP64 Support but is lackluster

https://cdrdv2-public.intel.com/824434/2024_Intel_Tech%20Tour%20TW_Xe2%20and%20Lunar%20Lakes%20GPU.pd.
View attachment 2661

Thank you. Is sparse acceleration support attributed to software rather than silicon? Based on the research paper, it seems that it is achieved via software for the data center gpus?

XYang2023 · Jan 9, 2025

siliconbruh999 said:
He is saying the truth The lowest precision is INT 2 but it is at the same rate as Int 4 on Intel GPUs and yes they don't support Sparsity so max TOPs remains the same as for offering support there is nothing wrong with having support but everything wrong with having it same way NVDA still has FP64 Support but is lackluster

https://cdrdv2-public.intel.com/824434/2024_Intel_Tech%20Tour%20TW_Xe2%20and%20Lunar%20Lakes%20GPU.pd.

For calculating TOPs you can do
Ops/Cycle * Clock Frequency * No. Of XMX/Tensor units
Apparently B580
160 XMX units
And from the chart 8 XMX core can do 4096 ops/clk and so
20*4096*2.850/1000(Boost clock for B580 apparently this is what at TOPS are calculated)
It is around 233 TOPS Of Int8 as stated on Intel ark.

You should definitely edit your video to correct the data point
View attachment 2661

Got you. I will edit the videos again using what you suggested.

"Most orgs in this space default to INT8, but we've asked Intel to clarify its numbers. INT4 trades more TOPS for less accuracy, but using INT4 to measure the NPU's performance wouldn't necessarily be a bad thing – it would lower memory requirements to run local AI models. But it would mean the performance comparison of Meteor Lake and Lunar Lake is not really apples to apples."

Intel says Lunar Lake’s NPU is 4x faster than Meteor

Pat Gelsinger claims 3x performance in next-gen silicon for AI PCs

www.theregister.com

But if we are trying to compare to Nvidia's numbers, should we do the following instead as Nvidia is FP4?
20*8192*2.850/1000=466.94 AI TOPS

Maybe I just indicate the basis for AI TOPS. Whether it is based on 4 bit or 8 bit.

siliconbruh999 · Jan 9, 2025

XYang2023 said:
Thank you. Is sparse acceleration support attributed to software rather than silicon? Based on the research paper, it seems that it is achieved via software for the data center gpus?

I don't know exactly maybe a HW thing or a SW trickery

XYang2023 said:
Got you. I will edit the videos again using what you suggested.

"Most orgs in this space default to INT8, but we've asked Intel to clarify its numbers. INT4 trades more TOPS for less accuracy, but using INT4 to measure the NPU's performance wouldn't necessarily be a bad thing – it would lower memory requirements to run local AI models. But it would mean the performance comparison of Meteor Lake and Lunar Lake is not really apples to apples."

Intel says Lunar Lake’s NPU is 4x faster than Meteor

Pat Gelsinger claims 3x performance in next-gen silicon for AI PCs

www.theregister.com

LNL has a NPU capable of 48 INT 8 TOPs while MTL is 11.4 Int 8 TOPS

XYang2023 said:
But if we are trying to compare to Nvidia's numbers, should we do the following instead as Nvidia is FP4?
20*8192*2.850/1000=466.94 AI TOPS

Maybe I just indicate the basis for AI TOPS. Whether it is based on 4 bit or 8 bit.

I think Int8 and BF16 are most relevant so you should quote both FP4/INT4 are too low

XYang2023 · Jan 9, 2025

siliconbruh999 said:
I don't know exactly maybe a HW thing or a SW trickery

LNL has a NPU capable of 48 INT 8 TOPs while MTL is 11.4 Int 8 TOPS

I think Int8 and BF16 are most relevant so you should quote both FP4/INT4 are too low

Yes. The subject is quite confusing. With BF16, I guess we probably should use TFLOPS but I see your point as BF16 is designed for calculations for deep learning.

The problem is that Nvidia is using FP4/INT4 as TOPS.

Maybe I do this:
FP16 (TFLOPS)
FP8/INT8 (TOPS)
FP4/INT4 (TOPS)
FP4 Sparse (TOPS)

Xebec · Jan 9, 2025

OT - but if you don't care about speed, there's a 16GB Raspberry Pi 5 now. Should be able to run 12-14B models under Linux.

XYang2023 · Jan 9, 2025

Xebec said:
OT - but if you don't care about speed, there's a 16GB Raspberry Pi 5 now. Should be able to run 12-14B models under Linux.

I think speed is important especially with agents.

siliconbruh999 · Jan 9, 2025

Some dude has been running llama models on the Chad i486

siliconbruh999 · Jan 9, 2025

XYang2023 said:
Yes. The subject is quite confusing. With BF16, I guess we probably should use TFLOPS but I see your point as BF16 is designed for calculations for deep learning.

The problem is that Nvidia is using FP4/INT4 as TOPS.

Maybe I do this:
FP16 (TFLOPS)
FP8/INT8 (TOPS)
FP4/INT4 (TOPS)
FP4 Sparse (TOPS)

Yeah I missed it it's just confusing with everyone boarding TOPS

XYang2023 · Jan 9, 2025

siliconbruh999 said:
Yeah I missed it it's just confusing with everyone boarding TOPS

yes... most people do not know what it exactly means. very confusing. When going through the calculation again, as B580 has some overclocking headroom, it is possible that the theoretical INT4 TOPS can reach 500 TOPS.

XYang2023 · Jan 9, 2025

Based on the following discussion, the performance advantage of sparse acceleration is not really obvious.

Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT

Originally published at: https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/ ○ TensorRT is an SDK for high-performance deep learning inference, and TensorRT 8.0 introduces support for sparsity that uses sparse tensor cores on NVIDIA Ampere GPUs...

forums.developer.nvidia.com

Intel Arc B580

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member