I haven't followed this in any detail and at a first reading found the Intel press release quite reassuring.
After reading @nghanaywem (above) and re-reading the Intel PR, I'm not so sure:
Short answer: We can confirm there was a via Oxidation manufacturing issue (addressed back in 2023) but it is not related to the instability issue.
Long answer: We can confirm that the via Oxidation manufacturing issue affected some early Intel Core 13th Gen desktop processors. However, the issue was root caused and addressed with manufacturing improvements and screens in 2023. We have also looked at it from the instability reports on Intel Core 13th Gen desktop processors and the analysis to-date has determined that only a small number of instability reports can be connected to the manufacturing issue.
Aren't those statements contradictory ?
Technically yes. My reading between the lines has me less concerned though. To me it reads:
-The root cause is CPUs requesting more voltage from the socket than it should be
-Hey this fab excursion did happen
-The oxidation is some secondary failure that was dependent on the voltage issue to even be a problem, so if we are being technical some small % of the failures are partially attributable to this
-Oxidation happened to such a small % of the total CPUs that the problem most of you face is highly unlikely to be anything besides the voltage issue
-We are disclosing the oxidation issue because if we said not a single CPU was impacted we would be lying
I could be wrong, but to me that is why I think the short and long answers don't perfectly match up. A case of for all intents and purposes the short answer is true, but not technically so intel need to disclose the full truth so nobody can ever say they lied or hid the full truth.
As for my theory on the oxidation being just a contributing factor rather than the problem itself my logic supporting this is the following three points:
1) it seems unlikely that large numbers of CPUs could get past die sort or even burn in without intel noticing, and if intel knew and released them anyways then if follows that they must have thought it wouldn't realistically be an issue in the intended environment over a 10 year lifetime.
2) Even if the CPU is only at 65w and clocked at 1GHz you will fry it if you pump say 5V into the core logic. I don't know exactly what the mechanisms behind high V killing chips is, but my assumption is dielectric breakdown in the ILD. Since interconnects are basically capacitors if the charge in the wires get to high you can see electricity arc from one to the other and ruin the insulative properties of the dielectric in between.
3) If the barrier layer is scuffed up on some of the CPUs then those CPUs should have an easier time arcing and potentially doing so at lower voltages. If it was really bad maybe some of the Cu migrated into the ILD and that could maybe make arcing occur at even lower voltages.
If my the above is what is happening in the field than I get why intel would say that oxidation isn't the root cause, but that some percentage of instability is related to this even if it wasn't technically the instigating event.
The first claims they are separate issues. The second states that there is a connection between them.
I'm also uneasy about the claimed fix. If it's oxidation, then manufacturing improvements seems credible. But how is the QA ("screens") change then relevant ?
I get the impression - again from a very brief reading - that there are/were at least three technical issues at play here.
That was more so me commenting on the IF side of the equation. If the issue was caught in sort that is bad for IF because foundry customers will likely be using their own OSATs. If the issue was caught inline then the screens are fine, but intel needs to make sure that suspect external material will be thrown out before it ever reaches customer hands. Think about it like food in your fridge. If I see something a little bit expired I might smell it see if their is any growths and if it looks/smells fine I might eat it to avoid wasting perfectly good food. If a restaurant did the same thing folks would be outraged because I am paying you and you fed my expired food. When you pay for something you don't want to pay for something that is maybe good or partially good, you want guaranteed 100% good. Redoing how IF does quality is only needed if the oxidized material would not have been sold if intel knew about it and somehow a large number of dies slipped past quality checks without anyone noticing, as in that instance the current systems are insufficient.
@nghanaywem is correct. All tech companies have issues like this from time to time. If you don't you're not innovating enough. It's how you deal with them that sorts out the good from the bad. You either increase customer trust or erode it.
Whie I do think that is true, but I was more so talking from the perspective of manufacturing. Sometimes a tool just breaks and starts producing output that is not within control, maybe an engineer wasn't paying attention to their control charts, operator error causes something nonstandard, etc. A good factory will have business processes and systems in place to minimize the occurrence of these events and will always catch them before it leaves the factory's walls.
I am also suspicious, because the via oxidation thing was only uncovered by a GN source and then GN sent faulty CPU samples to a lab to be analyzed. Intel were then forced to admit that there was indeed a manufacturing problem, because the lab analysis would have proved the via oxidation claim to be true.
Yeah... no. If this is something that only impacted some of the early RPL material then, it is highly unlikely GN has an oxidized CPU.