Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/threads/intel-arc-b580.21735/page-2
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021770
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

Intel Arc B580

Thanks for the articles -- the first link was really interesting with how it's sorta not parallel when training but you do get benefits of multiple GPUs. The extra bandwidth of NVLink makes a lot of sense for this.

I think today the only PCIe 5.0 GPUs available are the new Blackwell professional GPUs, and the future 50 series?
You might be right:

 
He is right Nvidia Published Number with Int 4 and Sparsity so to get unsparse number divide by 2 and than again divide by 2 to convert smaller to bigger so
1000/(2*2*2) = 125TOPS FP16
Does "1000 AI TOPs" mean INT4 or FP4?

What do sparse and dense mean in this context?

I initially thought, to convert FP4 to FP16 is by dividing FP4 TOPs by 4.
 
He is right Nvidia Published Number with Int 4 and Sparsity so to get unsparse number divide by 2 and than again divide by 2 to convert smaller to bigger so
1000/(2*2*2) = 125TOPS FP16
 
Does "1000 AI TOPs" mean INT4 or FP4?

What do sparse and dense mean in this context?

I initially thought, to convert FP4 to FP16 is by dividing FP4 TOPs by 4.
Nvidia Marketing usually puts lowest precision numbers with sparsity so the 1000 TOPS were Int4/FP4 with Sparsity here is a look at their ada Whitepaper using sparsity just 2X the theoretical output so you have to account for that and to divide it by additional 2 in account for sparsity
IMG_20250107_234718.jpg
 
A person commented on my latest video, pointing out that Intel’s ARC GPU does not support sparse acceleration. Here is the comment:


“5:49 Intel ARC does not accelerate sparse matrix multiplication. 1× FP16 = 1× BF16 = 2× INT8 = 4× INT4 = 4× INT2. So, 117 TFLOPS FP16 is 468 TOPS INT4. And these are only maximum theoretical values, which are unlikely to be achievable in real applications.”

He also mentioned that the performance for INT2 is the same as INT4.

I came across this paper:
PERFORMANCE OPTIMIZATION OF DEEP LEARNING SPARSE MATRIX KERNELS ON INTEL MAX SERIES GPU
https://arxiv.org/pdf/2311.00368

My questions are:
  1. 1. How to validate what he says? Is there any documentation to check?
  2. 2. If INT2 performance is the same as INT4, why offer support for INT2 at all?
  3. 3. What is the underlying reason for lacking sparse acceleration? Is it due to oneAPI/software?
 
A person commented on my latest video, pointing out that Intel’s ARC GPU does not support sparse acceleration. Here is the comment:




He also mentioned that the performance for INT2 is the same as INT4.

I came across this paper:
PERFORMANCE OPTIMIZATION OF DEEP LEARNING SPARSE MATRIX KERNELS ON INTEL MAX SERIES GPU
https://arxiv.org/pdf/2311.00368

My questions are:
  1. 1. How to validate what he says? Is there any documentation to check?
  2. 2. If INT2 performance is the same as INT4, why offer support for INT2 at all?
  3. 3. What is the underlying reason for lacking sparse acceleration? Is it due to oneAPI/software?
He is saying the truth The lowest precision is INT 2 but it is at the same rate as Int 4 on Intel GPUs and yes they don't support Sparsity so max TOPs remains the same as for offering support there is nothing wrong with having support but everything wrong with having it same way NVDA still has FP64 Support but is lackluster

For calculating TOPs you can do
Ops/Cycle * Clock Frequency * No. Of XMX/Tensor units
Apparently B580
160 XMX units
And from the chart 8 XMX core can do 4096 ops/clk and so
20*4096*2.850/1000(Boost clock for B580 apparently this is what at TOPS are calculated)
It is around 233 TOPS Of Int8 as stated on Intel ark.

You should definitely edit your video to correct the data point
IMG_20250109_200322.jpg
 
Last edited:
He is saying the truth The lowest precision is INT 4 on Intel GPUs and yes they don't support Sparsity so max TOPs remains the same as for offering support there is nothing wrong with having support but everything wrong with having it same way NVDA still has FP64 Support but is lackluster
View attachment 2661
Thank you. Is sparse acceleration support attributed to software rather than silicon? Based on the research paper, it seems that it is achieved via software for the data center gpus?
 
He is saying the truth The lowest precision is INT 2 but it is at the same rate as Int 4 on Intel GPUs and yes they don't support Sparsity so max TOPs remains the same as for offering support there is nothing wrong with having support but everything wrong with having it same way NVDA still has FP64 Support but is lackluster

For calculating TOPs you can do
Ops/Cycle * Clock Frequency * No. Of XMX/Tensor units
Apparently B580
160 XMX units
And from the chart 8 XMX core can do 4096 ops/clk and so
20*4096*2.850/1000(Boost clock for B580 apparently this is what at TOPS are calculated)
It is around 233 TOPS Of Int8 as stated on Intel ark.

You should definitely edit your video to correct the data point
View attachment 2661

Got you. I will edit the videos again using what you suggested.

"Most orgs in this space default to INT8, but we've asked Intel to clarify its numbers. INT4 trades more TOPS for less accuracy, but using INT4 to measure the NPU's performance wouldn't necessarily be a bad thing – it would lower memory requirements to run local AI models. But it would mean the performance comparison of Meteor Lake and Lunar Lake is not really apples to apples."

But if we are trying to compare to Nvidia's numbers, should we do the following instead as Nvidia is FP4?
20*8192*2.850/1000=466.94 AI TOPS

Maybe I just indicate the basis for AI TOPS. Whether it is based on 4 bit or 8 bit.
 
Thank you. Is sparse acceleration support attributed to software rather than silicon? Based on the research paper, it seems that it is achieved via software for the data center gpus?
I don't know exactly maybe a HW thing or a SW trickery
Got you. I will edit the videos again using what you suggested.

"Most orgs in this space default to INT8, but we've asked Intel to clarify its numbers. INT4 trades more TOPS for less accuracy, but using INT4 to measure the NPU's performance wouldn't necessarily be a bad thing – it would lower memory requirements to run local AI models. But it would mean the performance comparison of Meteor Lake and Lunar Lake is not really apples to apples."
LNL has a NPU capable of 48 INT 8 TOPs while MTL is 11.4 Int 8 TOPS
But if we are trying to compare to Nvidia's numbers, should we do the following instead as Nvidia is FP4?
20*8192*2.850/1000=466.94 AI TOPS

Maybe I just indicate the basis for AI TOPS. Whether it is based on 4 bit or 8 bit.
I think Int8 and BF16 are most relevant so you should quote both FP4/INT4 are too low
 
I don't know exactly maybe a HW thing or a SW trickery

LNL has a NPU capable of 48 INT 8 TOPs while MTL is 11.4 Int 8 TOPS

I think Int8 and BF16 are most relevant so you should quote both FP4/INT4 are too low
Yes. The subject is quite confusing. With BF16, I guess we probably should use TFLOPS but I see your point as BF16 is designed for calculations for deep learning.

The problem is that Nvidia is using FP4/INT4 as TOPS.

Maybe I do this:
FP16 (TFLOPS)
FP8/INT8 (TOPS)
FP4/INT4 (TOPS)
FP4 Sparse (TOPS)
 
OT - but if you don't care about speed, there's a 16GB Raspberry Pi 5 now. Should be able to run 12-14B models under Linux.
 
Yes. The subject is quite confusing. With BF16, I guess we probably should use TFLOPS but I see your point as BF16 is designed for calculations for deep learning.

The problem is that Nvidia is using FP4/INT4 as TOPS.

Maybe I do this:
FP16 (TFLOPS)
FP8/INT8 (TOPS)
FP4/INT4 (TOPS)
FP4 Sparse (TOPS)
Yeah I missed it it's just confusing with everyone boarding TOPS
 
Yeah I missed it it's just confusing with everyone boarding TOPS
yes... most people do not know what it exactly means. very confusing. When going through the calculation again, as B580 has some overclocking headroom, it is possible that the theoretical INT4 TOPS can reach 500 TOPS.
 
Based on the following discussion, the performance advantage of sparse acceleration is not really obvious.

 
Back
Top