Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/threads/intel-arc-b580.21735/page-3
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021770
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

Intel Arc B580

yes... most people do not know what it exactly means. very confusing. When going through the calculation again, as B580 has some overclocking headroom, it is possible that the theoretical INT4 TOPS can reach 500 TOPS.
Hmm. Let's think about that for a sec.

From what I've read, the RTX 4090 doesn't really benefit from more than 300W in AI model processing because there isn't enough bandwidth on the card to feed the CUDA cores any faster when running LLMs. The 4090 has (per Google AI) ~ 1320 INT4 TFLOPS, but is effectively reduced to ~ 900 because of bandwidth limitations (300W/450W * 1320 INT4 TFLOPS).

If this ratio holds true for B580, then B580 has enough bandwidth (450GB/sec) for about 400-450 INT4 TOPS (B580 has 45% of the bandwidth of 4090). Overclocking the B580 VRAM would help of course.

I think 2 x 24GB B580s would be pretty good value for LLM usage.
 
I redid the analysis. Thanks siliconbruh999 for pointing to the right information.

A viewer commented on my video with more PyTorch testing:
1736717203610.png
 
Made a video to share my experience running Deepseek R1 on B580:
Good video on the topic. Nice to see it just works fairly easily on the Intel card.

I think the cost model discussion is a good starter on the topic. Of course more privacy locally, but you also have to have a PC you can dedicate to the task on top of the card.

I really liked the comparison of token output speed vs. human reading speed (oral and silent reading) - that gave great perspective.

I am curious if you run that same model on your i5-13500 (CPU only) - what token rate do you get?
 
Good video on the topic. Nice to see it just works fairly easily on the Intel card.

I think the cost model discussion is a good starter on the topic. Of course more privacy locally, but you also have to have a PC you can dedicate to the task on top of the card.

I really liked the comparison of token output speed vs. human reading speed (oral and silent reading) - that gave great perspective.

I am curious if you run that same model on your i5-13500 (CPU only) - what token rate do you get?
It’s my wife’s PC. To do that, I would need to remove the specific version of Ollama and install a standard version that currently uses the CPU. I might do it later. I’m waiting for a new power supply for my workstation, which will make it easier to handle these tasks.
 
It’s my wife’s PC. To do that, I would need to remove the specific version of Ollama and install a standard version that currently uses the CPU. I might do it later. I’m waiting for a new power supply for my workstation, which will make it easier to handle these tasks.
Do you know off hand how to force ollama to use the CPU? I tried a while back and kept failing. I couldn't find any instructions that actually worked. I can benchmark Deepseek on my 9800X3D for a data point. (and also an i5-12600KF + DDR4 if interested).
 
Do you know off hand how to force ollama to use the CPU? I tried a while back and kept failing. I couldn't find any instructions that actually worked. I can benchmark Deepseek on my 9800X3D for a data point. (and also an i5-12600KF + DDR4 if interested).
Are you using an Intel GPU or Nvidia GPU? For Intel GPU, if you get the standard installation of Ollama, I think that will run on CPU by default.

You can see what happens via task manager.

You could try the following suggestion:

In ollama,"/set parameter num_gpu 0"

 
Did you really mean it? Or a bit of sarcasm:)
I really hope Intel can turn itself around.
Definitely mean it.

I hope they can they have been the biggest risk takers in moving manufacturing and state of the art tech TSMC is very risk averse they first observe than jump unlike Intel which jumps than observes.

I don't want due to few dumb and greedy people the whole Intel suffers
 
Interesting to see the B580 inference videos, many thanks for them. I am keenly waiting for the 24GB B580 to try/potentially replace my Tesla P40.

I was wondering something, the amount of local ram is clearly a constraint for inference. The ram chips themselves are not too expensive, but there is no footprint on the boards/exposed address lines. Old computers used to get around lack of address lines using a register for some extra address bits, i.e. bank switching. Could a bank switched expansion board be made cheaply for B580, or other similar GPU, with a few registered bits and footprints for N ram chip footprints allowing for say 256GB of local ram fairly cheaply? Perhaps clock speed will be too much of a constraint to make the traces long enough?
 
Last edited:
Interesting to see the B580 inference videos, many thanks for them. I am keenly waiting for the 24GB B580 to try/potentially replace my Tesla P40.

I was wondering something, the amount of local ram is clearly a constraint for inference. The ram chips themselves are not too expensive, but there is no footprint on the boards/exposed address lines. Old computers used to get around lack of address lines using a register for some extra address bits, i.e. bank switching. Could a bank switched expansion board be made cheaply for B580, or other similar GPU, with a few registered bits and footprints for N ram chip footprints allowing for say 256GB of local ram fairly cheaply? Perhaps clock speed will be too much of a constraint to make the traces long enough?
Clock speed is the main problem with this approach on modern computers. The traces / wires will be too long to maintain signal integrity. Unfortunately.
 
Interesting to see the B580 inference videos, many thanks for them. I am keenly waiting for the 24GB B580 to try/potentially replace my Tesla P40.

I was wondering something, the amount of local ram is clearly a constraint for inference. The ram chips themselves are not too expensive, but there is no footprint on the boards/exposed address lines. Old computers used to get around lack of address lines using a register for some extra address bits, i.e. bank switching. Could a bank switched expansion board be made cheaply for B580, or other similar GPU, with a few registered bits and footprints for N ram chip footprints allowing for say 256GB of local ram fairly cheaply? Perhaps clock speed will be too much of a constraint to make the traces long enough?
 
Back
Top