Christmas Season CPU Best Sellers Rank on Amazon: AMD vs. Intel

XYang2023 · Dec 27, 2024

siliconbruh999 said:
View attachment 2607
mine as well btw you can't install xpu and cuda at the same time

At the moment, PyTorch/XPU does not support B580 in Linux. So I tested the nightly release on Windows. But I experienced the following error when executing my own code:

Someone reported this error in July:

RuntimeError: could not create an engine · Issue #2379 · oneapi-src/oneAPI-samples

Hi Intel Team I am observing "could not create an engine" error in executing demo.py example from "oneCCL Bindings for PyTorch Getting Started Sample*". The code is run on Saphire node with 4 PVCs ...

github.com

I am not sure which github repository to report such error.

siliconbruh999 · Dec 27, 2024

XYang2023 said:
At the moment, PyTorch/XPU does not support B580 in Linux. So I tested the nightly release on Windows. But I experienced the following error when executing my own code:
View attachment 2608
Someone reported this error in July:

RuntimeError: could not create an engine · Issue #2379 · oneapi-src/oneAPI-samples

Hi Intel Team I am observing "could not create an engine" error in executing demo.py example from "oneCCL Bindings for PyTorch Getting Started Sample*". The code is run on Saphire node with 4 PVCs ...

github.com

I am not sure which github repository to report such error.

Damm maybe it will take a year before it becomes good

Xebec · Dec 27, 2024

FYI - A table has "leaked" showing the performance improvement (top end clocks *and* perf/watt) in moving Redwood Cove from Intel 4 to Intel 3. (Both Meteor Lake and Arrow Lake-U use Redwood Cove P-cores and Crestmont E-Cores).

The table shows P-Core differences at 12, 15, and 28W levels, and E-core differences at the 15W level. A few take-aways:

- P-Core turbo clock increases about 8-11% (4.8-5.3 GHz vs 4.3-4.9 GHz)
- E-Core turbo clock increases about 10% (4.2 GHz vs 3.8 GHz)
- Efficiency gains at all clock speeds are significant:
1. The 5-8 core "E core" clock increases from 3.4 GHz max to 4.1 GHz
2. At 12W, P-Cores base clock increases by over 20% (1.4 to 1.8 GHz)
3. At 15W, E-Core base clock increases by over 40% (1.2 GHz to 1.7 GHz), and P-Core base clock increases by about 20%

Source: https://x.com/jaykihn0/status/1858299099608752597?s=46

XYang2023 · Dec 27, 2024

XYang2023 said:
At the moment, PyTorch/XPU does not support B580 in Linux. So I tested the nightly release on Windows. But I experienced the following error when executing my own code:
View attachment 2608
Someone reported this error in July:

RuntimeError: could not create an engine · Issue #2379 · oneapi-src/oneAPI-samples

Hi Intel Team I am observing "could not create an engine" error in executing demo.py example from "oneCCL Bindings for PyTorch Getting Started Sample*". The code is run on Saphire node with 4 PVCs ...

github.com

I am not sure which github repository to report such error.

I reported it on the PyTorch GitHub page. It should be fine since Falcon Shores is expected to launch at the end of 2025. This card is primarily targeted at gaming, but I’m sure some people will use it for PyTorch, which is a good thing for software readiness.

RuntimeError: could not create an engine · Issue #143914 · pytorch/pytorch

🐛 Describe the bug Hi, I experienced the following error (the message before the exception): File c:\Users\xiaoy\anaconda3\envs\llm2\Lib\site-packages\torch\nn\modules\linear.py:125, in Linear.forw...

github.com

MKWVentures · Dec 27, 2024

siliconbruh999 said:
Stacy the same guy who was at Boeing during yhe debacle i am having deja Vu with him

I dont think so.

Stacy is the same guy who is executive director and refinanced Kioxia and Managed Intel operations and finance for years and was brought in to "advise"

siliconbruh999 · Dec 27, 2024

MKWVentures said:
I dont think so.

Stacy is the same guy who is executive director and refinanced Kioxia and Managed Intel operations and finance for years and was brought in to "advise"

Ok I mistook him for the Boeing dude

dkr1986 · Dec 27, 2024

siliconbruh999 said:
Ok I mistook him for the Boeing dude

That is this guy. Greg D. Smith

Gregory D. Smith

Gregory D. Smith joined Intel Corporation’s board of directors in March 2017.

www.intel.com

OneEng · Dec 27, 2024

TICin852 said:
My two cents here: The flagship DC products are just like flagship products - for show, and low volume. The arguments about Diamond Rapids vs whatever massive 384 core single socket AMD CPU are a bit overblown (I feel) - it doesn't matter as the price of memory is going to be insane for those servers - Even if you have 12/16 channels of memory, each stick would have to be at least 128gb in order to provide enough memory to keep those cores fed. The actual cost is absolutely insane so I doubt most server owners would be using that.

SRF and GNR only needed to be competitive in the lower core counts to keep the market share. The fact that Intel's little cores are able to match AMD's IPC is a key differentiator in the types of workloads they are going after. GNR will have an impact on margins because as we go into 2025 the platform adoption will stop Intel bleeding and stabilize the DC market - margins might take a hit but will at least stabilize, allowing us to see if Intel will really go bankrupt or if they have enough money to keep going for several more years.

First, for many HPC applications, the hardware cost is insignificant compared to the annual licensing (For instance Oracle and its per cpu per year charge).

Higher core count HPC servers are very desirable for scalability, and these applications thrive on lots of RAM.

Second, Intel's little cores are not able to match AMD's performance per core (and currently not able to match Zen 5c IPC level either, only Zen 4 IPC ... and then only in SpecInt AFAIK). In a MT application, Zen 5 gains around 40% on average performance through SMT placing it safely at about 1.5 times the performance of a single Skymont core today.

... and back to that software licensing argument.... simply packing more, lesser performant, smaller cores onto a die will only work in hyperscaler/cloud applications where it is unusual to have per core licensing. In HPC this would be a non-starter IMO.

OneEng · Dec 27, 2024

dkr1986 said:
I don't think there is anything wrong with that margin calculation.

I believe Intel Product group margin is being calculated with market wafer pricing for similar node process in market used for making those products. This is like Intel Product team buying wafers from TSMC prices from Intel foundry. (I think that is what they are supposed to be doing ideally but in reality, there may be some bias or massaging numbers stuff going on).
This way, Intel foundry being not a sound business financially (high cost, inefficiency, low utilization, idle fabs, huge depreciation cost etc) is not Intel products team's concern and it gives an opportunity for Intel management to identify inefficiencies and fix that.

Now I fully acknowledge that Intel Foundry was & is a strong moat for Intel that made them king of compute for the last couple of decades. But that does not mean they were run efficiently or profitably (in an IDM structure, it was not even a concern). Even in 2021, when Intel was making record revenue of $79b and profits, Intel foundry was making an operating loss of $5b!. Their operating margin at that time was 25% for the entire business, so Intel products must have had much higher margin to offset this foundry loss.
https://www.intc.com/filings-report...0000050863-24-000068/0000050863-24-000068.pdf

So I don't believe Intel Foundry division was ever profitable on their own and being an IDM, nobody at Intel cared. Pat and Dave spoke about it many times about too many steppings, hot lots etc.

The real problem I have with Intel margin calculation there is another operating expense bucket called "Unallocated Corporate Expenses" that include Stock Based Compensation but is not included in segment operating profit calculation. I believe that is supposed to be included in COGS, R&D and SGA depending on whoever gets these stocks. So, I do think Intel product margin is overstated a little due to this. I am not a finance person to truly know if what they are doing this way is acceptable per GAAP but these results were audited and approved.

I think(imho) Lion Cove is actually not that bad. There is definitely a latency issue due to the tiles approach vs monolithic but Robert Hallock (Intel) said in recent interview its firmware/microcode issues that is overblowing this issue. Also recent patches/updates by game developers had improved gaming performance of ARL-S, for example Cyberpunk 2077 had a huge uplift that puts the 285K in line or slightly above 14900k while consume less power. So with time, some of these issues will be ironed out. I don't expect it to beat 9940x or 14900K across the benchmarks anytime.

Also with regard to process nodes, Zen 5 is on N4P (if I am not mistaken), so N3 (which N3B now) is 3 to 8% better in power and 0-4% better in performance per following graphic. And I am pretty sure, N3 is very costly too.
Considering this is one of the first time, Intel engineers are working with an external foundry to make their products and first tiles approach on desktop sort of Zen 2 for AMD (based on what I see, Zen 2 was not received well for gaming too), I think we can cut them some slack this one time.

View attachment 2605

While it is, of course, acceptable practice to have your business units roll up however you decide it in the company, it is generally used as a tool to accomplish a political goal (I have done it myself). In this case, someone decided to make a case to break off the foundry as a separate business unit. The first step for this is to start reporting that part of the company as a top line item in the rollup.

Intel had the tail wagging the dog with respect to design and foundry IMO. Foundry should have set the design constraints, time to new processes, and the cost per wafer and Design should have been required to make a product that could be profitable under those constraints. When you do it the other way around, foundry is forced into costly, risky, inefficient business models.

In short, I think it was logical and easy for Intel to blame all their financial heartache on the foundry instead of the golden child

.

I am of the opinion that Intel has only done well over the past several decades BECAUSE of the foundry and that their designs have been ..... mostly just adequate (with some notable exceptions which were brilliant) .... and sometimes ..... strategically stupid (Afore mentioned Netburst and Itanic).

With regard to your N3B vs N4P analysis, I agree. You missed out that N3B is ~15% higher transistor density than N4P. Along with the other metrics, I think it is safe to say that one could make a more performant processor on N3B than N4P ..... all other things being equal. At least this is how I have had it figured.

Fair point on Zen 2 and AMD's first foray into chiplets. They also had issues with latency and performance that weren't really addressed until Zen 3. My biggest issue with Intel is waiting so long to start doing tiles. I also think they should have anticipated the issues from AMD's history and did a little better job with their first tiled processor as a result.

siliconbruh999 · Dec 27, 2024

OneEng said:
First, for many HPC applications, the hardware cost is insignificant compared to the annual licensing (For instance Oracle and its per cpu per year charge).

Higher core count HPC servers are very desirable for scalability, and these applications thrive on lots of RAM.

Second, Intel's little cores are not able to match AMD's performance per core (and currently not able to match Zen 5c IPC level either, only Zen 4 IPC ... and then only in SpecInt AFAIK). In a MT application, Zen 5 gains around 40% on average performance through SMT placing it safely at about 1.5 times the performance of a single Skymont core today.

If you turn off SMT the 40% disappears and Darkmont is not even released yet many people do this depending on their priority and it's not IPC It it's more like throughout imo also for 40% throughout you have like 2X the thread as well vs 1.5X physical core and thread

for big clouds like Amazon/Google/Microsoft they almost have all their stuff Inhouse so licencing goes down the drain as well but you can say they can park 33% more VMs on Turin Dense

OneEng said:
... and back to that software licensing argument.... simply packing more, lesser performant, smaller cores onto a die will only work in hyperscaler/cloud applications where it is unusual to have per core licensing. In HPC this would be a non-starter IMO.

Also this chip is not for HPC we have GNR/Turin for HPC

OneEng · Dec 28, 2024

siliconbruh999 said:
If you turn off SMT the 40% disappears and Darkmont is not even released yet many people do this depending on their priority and it's not IPC It it's more like throughout imo also for 40% throughout you have like 2X the thread as well vs 1.5X physical core and thread for big clouds like Amazon/Google/Microsoft they almost have all their stuff Inhouse so licencing goes down the drain as well but you can say they can park 33% more VMs on Turin Dense

Also this chip is not for HPC we have GNR/Turin for HPC

[chuckles] OK then ... 33% still sounds like a resounding win for Turin D to me

. Interesting point on licensing though. I wonder how much of the market these super-customers represent?

For HPC, Intel intends Diamond Rapids in 2026 using Panther Cove X (updated Lion Cove). I see plenty about added AVX10 and APX (seems like Intel just wants to forget AVX512 ever happened), but nothing about SMT.

For the 10-15% cost in transistors, AMD shows about a 40% uplift in highly threaded workloads. I can't imagine a better trade than that so I am confused at Intel's loss of this feature..... especially if they hope to compete in the highly profitable (and growing) DC market moving forward.

siliconbruh999 · Dec 28, 2024

OneEng said:
[chuckles] OK then ... 33% still sounds like a resounding win for Turin D to me . Interesting point on licensing though. I wonder how much of the market these super-customers represent?

AWS Azure and Google Cloud that's basically 80-90% of the entire cloud market

OneEng said:
For HPC, Intel intends Diamond Rapids in 2026 using Panther Cove X (updated Lion Cove). I see plenty about added AVX10 and APX (seems like Intel just wants to forget AVX512 ever happened), but nothing about SMT.

It does have SMT AVX10/512 is a superset to AVX-512 and it removes fragmentation

x.com

OneEng said:
For the 10-15% cost in transistors, AMD shows about a 40% uplift in highly threaded workloads. I can't imagine a better trade than that so I am confused at Intel's loss of this feature..... especially if they hope to compete in the highly profitable (and growing) DC market moving forward.

Your maths is true in HPC workload for general Integer workload Turin D is not going to be much better than Clearwater forest

XYang2023 · Dec 29, 2024

During my review, I noticed a few areas where improvements could enhance the experience for developers and AI enthusiasts:

1. Fix PyTorch Compatibility Issues: Intel and PyTorch should address software bugs to ensure that common PyTorch code run smoothly on the B580 GPU. They could leverage coding exercises from deep learning coursework, such as those from Stanford University. Here’s the link to the error I experienced: https://github.com/pytorch/pytorch/issues/143914
2. Provide Precompiled Libraries: Intel should offer compiled versions of torchvision and torchaudio for Windows, optimized for Intel GPUs, to simplify the setup process for non-Linux users.
3. Support for Ollama and Similar APIs: It's crucial to ensure tools like Ollama can seamlessly utilize Intel GPUs out of the box. This would empower developers to build applications on top of LLMs using Ollama's API.

OneEng · Dec 29, 2024

siliconbruh999 said:
AWS Azure and Google Cloud that's basically 80-90% of the entire cloud market

It does have SMT AVX10/512 is a superset to AVX-512 and it removes fragmentation

x.com

x.com

View attachment 2612

Your maths is true in HPC workload for general Integer workload Turin D is not going to be much better than Clearwater forest

Thanks. I can't quite see how the graphic shows Clearwater Forest or other Intel parts getting SMT. Could you clarify?

Not sure about CWF general work loads vs Turin D. I think it will depend on core count and bandwidth.

siliconbruh999 · Dec 29, 2024

OneEng said:
Thanks. I can't quite see how the graphic shows Clearwater Forest or other Intel parts getting SMT. Could you clarify?

This only shows AVX-512 Fragmentation which AVX10 solves Clearwater Forest is based on Darkmont an E core it doesn't have HT but the P cores have HT Intels own presentation during lion cobe showed they have HT as a tool DMR has HT you have to take my word for it

OneEng said:
Not sure about CWF general work loads vs Turin D. I think it will depend on core count and bandwidth.

Yeah there are many workload like db and other that prefer Integer performance and in that scenario skymont is lot closer to Turinas for workload this chart from Inte comes in handy

Search

Christmas Season CPU Best Sellers Rank on Amazon: AMD vs. Intel

XYang2023

Well-known member

RuntimeError: could not create an engine · Issue #2379 · oneapi-src/oneAPI-samples

siliconbruh999

Well-known member

RuntimeError: could not create an engine · Issue #2379 · oneapi-src/oneAPI-samples

Xebec

Well-known member

XYang2023

Well-known member

RuntimeError: could not create an engine · Issue #2379 · oneapi-src/oneAPI-samples

RuntimeError: could not create an engine · Issue #143914 · pytorch/pytorch

MKWVentures

Moderator

siliconbruh999

Well-known member

dkr1986

Active member

Gregory D. Smith

OneEng

Active member

OneEng

Active member

siliconbruh999

Well-known member

OneEng

Active member

siliconbruh999

Well-known member

x.com

XYang2023

Well-known member

OneEng

Active member

x.com

siliconbruh999

Well-known member