Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/threads/intel-13th-and-14th-gen-core-i9-stability-problems.20614/page-3
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021770
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

Intel 13th and 14th gen Core i9 stability problems

Just a two comments - higher temperature makes chips more sensitive to voltage damage; they could also be adjusting the tables for temperature vs. voltage allowance.

Re: over voltage; it’s not completely clear if the chips are going ‘too high’ on voltages only or maybe aren’t requesting enough for a frequency and crashing at times too; 5.8-6.2 GHz requires a lot of voltage (depending on instructions). They’ve been recommending limiting power (amps) but that sort of implies a temperature vs voltage issue (above) that may not be tuned on both ends.

(AMD started the ‘very high voltage for max single frequency’ engineering on desktop with Zen 1 - which used 1.45V for 4.1 GHz on GloFo 14nm, while ~ 1.40V was already considered damaging long term even in the 32nm days).
 
So, EM or SM?
Intel also said that the vast majority of the issues are not even tangentially related to the oxidation issue. Even if we assume intel is blatantly lying EM and the via oxidation aren't the same thing. They are different phenomenon. Arcing/dielectric breakdown is also a completely different issue from EM.
Stress migration might explain why lower voltage and frequencies don't fix the issue (at least not completely).
Intel's explanation doesn't is independent of freq, which matches up with findings in the wild.
That said, the microcode fix Intel is going to release, might suggest that electromigration is more likely.
The problem isn't copper diffusing into the ILD though.
Are they gambling though? I mean, it is possible that reducing max current and max voltage of such CPUs might reduce the amount of failures quite consistently. But as many mentioned already, it is fine to get parts that might still fail sooner or later? Probably they would rather spread the blaming over a longer timeframe for now, maybe they just can't afford to do the right thing immediately. And maybe they hope that the next gen of CPUs would buy them the full trust once more. If those "patched" CPUs last over 2 years, the hit they take is marginal eventually. Of course they risk losing their reputation completely if the problem explodes later and the fix is just delaying the inevitable. That's of course the worst case scenario. I don't think it's gonna happen. EM fits well imo. Probably the node is marginal and definitely not meant to be pushed so hard as they did with the i9 parts. Let's wait and see.
Intel doesn't claim that the problem is "juicing the i9". Which makes sense because lowering frequency doesn't seem to fix the issue. What they claimed is the CPUs were drawing more than they were supposed to. As an example let's say you need 1.65V to run at 6GHz and 1.2V to run at 4GHz stability. My understanding of the statement is that Intel claims the problem is that due to a bug in the microcode instead of drawing 1.65V at 6GHz, that i9 is trying to draw let's say 2V, and that at 4GHz it is also trying to draw 2V. The claim seems to be that the problem is not the 1.65V stock behaviour but the unintended behavior of asking the socket for more V than it needs. This seems to provide a reasonable explanation for why lower TDP and freq RPL-W CPUs seem to have issues (because it is drawing erenous voltages). If this is indeed the root cause, then I don't see why fixing the microcode so it only draws the intended voltage at a given frequency wouldn't stop a CPU from from continuing to harm themselves.
 
GN says Intel's biggest customer had a 10 - 25% failure rate on all their Raptor Lake PCs, out of a total of 6 million PCs. The microcode fixes the problem, assuming that damage hasn't already been done by the voltage overload. I would guess that most customers will play it safe and just return the product, rather than take the chance that the CPU might already be permanently damaged. Lets say if 10% of all Raptor Lake CPUs ever sold are returned - that's going to be a big write-off on their next earnings call.
 
Intel also said that the vast majority of the issues are not even tangentially related to the oxidation issue. Even if we assume intel is blatantly lying EM and the via oxidation aren't the same thing. They are different phenomenon. Arcing/dielectric breakdown is also a completely different issue from EM.

Intel's explanation doesn't is independent of freq, which matches up with findings in the wild.

The problem isn't copper diffusing into the ILD though.

Intel doesn't claim that the problem is "juicing the i9". Which makes sense because lowering frequency doesn't seem to fix the issue. What they claimed is the CPUs were drawing more than they were supposed to. As an example let's say you need 1.65V to run at 6GHz and 1.2V to run at 4GHz stability. My understanding of the statement is that Intel claims the problem is that due to a bug in the microcode instead of drawing 1.65V at 6GHz, that i9 is trying to draw let's say 2V, and that at 4GHz it is also trying to draw 2V. The claim seems to be that the problem is not the 1.65V stock behaviour but the unintended behavior of asking the socket for more V than it needs. This seems to provide a reasonable explanation for why lower TDP and freq RPL-W CPUs seem to have issues (because it is drawing erenous voltages). If this is indeed the root cause, then I don't see why fixing the microcode so it only draws the intended voltage at a given frequency wouldn't stop a CPU from from continuing to harm themselves.
i9 CPUs have a way higher "failure" rate than i7 or i5 though. So, if the CPU is just behaving erratic in terms of voltage/current draw, all parts should be affected almost the same way (and as far as I know, it is not the case). And we also know that they sold way more i7 than i9. So, something is clearly weird/suspicious here in terms of numbers. Also, it wouldn't explain why 12th generation seems not affected or at least, affected just in a very marginal way if compared to 13th and 14th ones. Let's be clear. I'm not saying Intel is selling faulty chips, just that the process might be marginal if pushed so hard, like they did for i9 K CPUs (and it doesn't matter if it wasn't intentional here, it is still a big mistake). In fact, non K parts (so with locked overclocking capabilities) so far are working fine it seems. Finally, the biggest concern is mainly about OEMs right now. Their customers used CPUs for many months already with "wrong settings" at the very least, and potentially they are all already degraded. The fix might be OK for the new ones, but definitely no way to guarantee that the used ones are gonna be fine long term. And no matters if the EM is caused by the marginal process or the microcode, it is still an Intel issue and they are accountable for it. I hate monopolies, so I truly hope they can solve this issue very quickly. But they haven’t be very transparent about it so far, and that's very bad.
 
Intel also said that the vast majority of the issues are not even tangentially related to the oxidation issue. Even if we assume intel is blatantly lying EM and the via oxidation aren't the same thing. They are different phenomenon. Arcing/dielectric breakdown is also a completely different issue from EM.
Intel never said that. Intel said they fixed the via problem in "2023". How many 13th gen CPUs were shipped between Y2022 and Dec. 2023 ? Millions & millions.

If it was fixed in early 2023 -- Intel should say that. (but they didn't)

One of Intel's customers said they were still seeing the same issue in April of 2024.
 
Given how Intel has handled (screwed over their own enthusiast customer base) this issue, I'm not planning on ever purchasing another Intel CPU again.

This is absolutely one of the worst ways Intel could respond to this absolute mess they dug themselves into. Instead of apologizing and trying to make things right, they pretend it's not a issue, stall, act like it's not a big deal and now try to sweep things under the rug while doing absolutely the bare minimum to make things right. 🤬😡😠
 
As some already mentioned, of course Intel knows what to do, the problem is they can't do that. If they recall 13th and 14th gen CPUs, they can only replace those with other problematic CPUs. Money back is not possible either, cause the affected user can't just get a competitor CPU and call it a day. Motherboards are different too. And Intel can't replace the whole hardware without going bankrupt (nobody is really too big to fail, nobody). So, they are living on their past reputation and taking the hit. What it is absolutely madness, and the risk of lawsuits or class actions is extremely high because of this, is the fact that they didn't stop selling those CPUs. This is simply unacceptable. Whoever made this call (no action is a decision nevertheless) is burned imo.
I'm also surprised this isn't the main topic of the forum right now. 😳
 
1000034602.jpg

Credits to the YouTube channel Moore's Law is Dead.
 
View attachment 2138
Credits to the YouTube channel Moore's Law is Dead.
I watched the video and when it got to this slide I had to pause for a minute because I was laughing. originally I thought, maybe there is a 10-20% chances he has actual sources but they go way over his head, but now I am pretty confident he just makes stuff up or scrapes leaks from elsewhere on the internet. The best part for me is the escalating levels of absurdity, it is so good :ROFLMAO:.

Point 1: EMR doesn't have a ring bus so it by definition can't get get cooked. Also the cores run well under 1.5V, heck well under 1V even when single core boosting. It is also funny seeing him slyly backpedal from Bartlett lake is a client part to it was never a client part after folks said he was dumb and it was an NEX part online. If the problem was the RPL/ADL ringbus if anything that problem should replicate in MTL since intel claims MTL CPU is a slightly tweaked version of RPL.

Point 2: Hard to imagine that a 10 year stress test wouldn't unearth the ringbus "cooking itself" at 1.5V.

Point 3: I don't see how running priority material through the fab is part of the problem. The HVAC going out is really funny to me. The tools have their own environments separate from the fab and for many steps of the process are in a vacuum (like PE-xVD, Etch, etc). Wafers are exposed to much hotter temps during wafer processing than Arizona springs can ever do. BOEs like the RCA clean are specifically in the flow to remove oxides. When in storage wafers sit in N2 precisely to prevent oxide films from growing. Then we get even more comical with claiming Keyvan had to fly to Arizona and sort through every wafer. If we run with this story what would Keyvan know about which wafers can be saved or not? That would be something only defmet and e-test folks would do and know. The image of Keyvan sitting on the floor looking at each wafer in a clamshell with a mountain of wafers behind him does put a grin on my face. Where I completely lost it was an intel 7 wafer costing as much as a Model X :ROFLMAO:. I'm snickering just typing that an intel 7 wafer costs 3-4x what TSMC might be selling N2 wafers for in a couple of years. Then we end on the last point "I don't know if the HVAC system that keeps the me cool and is causing oxidation is fixed yet, but it probably is". Just *chief's kiss* it is all wonderful.

On a less jolly note:
Given how Intel has handled (screwed over their own enthusiast customer base) this issue, I'm not planning on ever purchasing another Intel CPU again.

This is absolutely one of the worst ways Intel could respond to this absolute mess they dug themselves into. Instead of apologizing and trying to make things right, they pretend it's not a issue, stall, act like it's not a big deal and now try to sweep things under the rug while doing absolutely the bare minimum to make things right. 🤬😡😠
Agreed the way intel is handling this is terrible and they have opened themselves up for folks wearing tin foil hats to be more credible than themselves. I think intel needs to earn your trust if they want you or other DIYers' money back. Many folks in that circle seem to already "just buy AMD" and this has given a material reason for people to do this.
 
Last edited:
EMR chips are indeed fine. So your point 1, actually supports his claims.
Point 2. Stress tests are fine for sure, but it doesn't mean you can't have defective/marginal CPUs later. You just test a very early and small sample size when you qualify the process/device.
If you rush things as they did, the risk of missing/skipping few process corners is higher. The real concern here is the binning instead. If you lower the sort "bar" to get more i9 out of i7 chips (to increase your margins), you could put your shipped device under a bigger pressure than you thought/planned if it is somehow marginal. And so far, they definitely seem marginal.
Now, I didn't care about the given root cause of the excursion (although the environment could definitely play a role too, there is nothing to laugh at), but I was more interested at the Fab and the timeframe, since Intel didn't disclose this information yet. Now, we sure can't tell that that part is true, but it sounds believable at least. The other comment about the wafer cost is simply trivial. 5k, 10k or 15k per wafer, do you think it would matter when you have to decide if you have to scrap your material or not? It is a crazy amount of money for many months of Fab production. I bet at Intel nobody laughed 😅
The Intel response so far has been worrisome to say the least. Bad RMA management, no accountability, they tried to blame MoBo manufacturers, then the users, then they released a bios fix that it's not a real fix (it is just a performance nerf), now they promised a new fix, finally acknowledging the issue (their issue), but never stopped selling the unpatched CPUs anyway.
Perhaps you laugh, but at Intel, they definitely cry, absolutely.
 
If the thermal velocity boost is limited by the 100 C instead of the 90 C, because you wanted to beat your competitor at any given cost, the likelihood of something going wrong is now definitely higher. Process and equipments might be the same, devices are not though.
 
EMR chips are indeed fine. So your point 1, actually supports his claims.
He said it was unlikely to be impacted materially. If he had any clue of what he was talking about he would know that a flaw with the ring bus is completely incompatible with EMR.
Point 2. Stress tests are fine for sure, but it doesn't mean you can't have defective/marginal CPUs later. You just test a very early and small sample size when you qualify the process/device.
Every CPU gets burned in, not some small sample size. With how common the issues seems and the allegation that the problem is a design issue that impacts every RPL die, the probability of not seeing that is basically 0%.
If you rush things as they did, the risk of missing/skipping few process corners is higher. The real concern here is the binning instead. If you lower the sort "bar" to get more i9 out of i7 chips (to increase your margins), you could put your shipped device under a bigger pressure than you thought/planned if it is somehow marginal. And so far, they definitely seem marginal.
Getting more i9s out of i7s would only matter if intel actually sold many of them. Desktop is like 20% of the CPU market and my bet is that of that 20% less than 10% are i9s. If the problem was binning too many i7s as i9s, why does the desktop i7 also seem to have so many problems.
Now, I didn't care about the given root cause of the excursion (although the environment could definitely play a role too, there is nothing to laugh at),
It is a laughing matter when the information in the video is clearly made up. I explained why that quote 3 is absurd, if you don't want think critically about that information that is your prerogative. Personally I think I would notice the temperature in my fab being like 20 deg F hotter than normal. So if you want to believe that a totally real "source" couldn't even tell you if the HVAC system is working again (over a year after the alleged outage) that is your lose not mine. He could have made up literally any other cause for oxidation and it would be far more believable than what he rolled with.
but I was more interested at the Fab and the timeframe, since Intel didn't disclose this information yet. Now, we sure can't tell that that part is true, but it sounds believable at least. The other comment about the wafer cost is simply trivial. 5k, 10k or 15k per wafer, do you think it would matter when you have to decide if you have to scrap your material or not? It is a crazy amount of money for many months of Fab production. I bet at Intel nobody laughed 😅
I know one who is laughing right now. To ground your understanding anyone who says an N5 wafer is priced near 20k has actually no clue what they are talking about or any inside information since the actual price is closer to half of that. For his "source" to claim an intel 7 wafer costs 70k proves without a shadow of doubt that there is no source. As I stated the comments got even more absurd and detached from reality as he went on.
The Intel response so far has been worrisome to say the least. Bad RMA management, no accountability, they tried to blame MoBo manufacturers, then the users, then they released a bios fix that it's not a real fix (it is just a performance nerf), now they promised a new fix, finally acknowledging the issue (their issue), but never stopped selling the unpatched CPUs anyway.
Perhaps you laugh, but at Intel, they definitely cry, absolutely.
The 13th/14th gen issue is real and you are right it isn't a laughing matter. However this ad-lib of a video is nothing but a laughing matter. When the theory has so many logical holes and has information that obviously false I have no reason to take anything in that video seriously.
 
Last edited:
Point 1: EMR doesn't have a ring bus so it by definition can't get get cooked. Also the cores run well under 1.5V, heck well under 1V even when single core boosting. I

I had to look this up, some examples of VIDs from before the updated BIOSes started flying around.

(Post) 14900KS - 1.533V for 62X multiplier (up to 2 cores), 1.413V for 59X
(Comments) - User’s 14900K has a VID of 1.479V at 6.0 GHz; but was seeing 1.545V out of the box
(Comments) - Some 14900K (not KS) have a specified 1.503V VID at 6.0 GHz; others are lower


Review of 13900K “As you can see the 13900K is using quite a bit more voltage than 12900K”:
1.3V nominally, spikes to 1.37V at the end of the test run

So at least some of the highest end 14th gen i9’s are calling for 1.5V. Motherboards can often apply more than this (see ASUS cooking 7800X3D’s), and transient spikes from loads changing and VRMs response can also cause spiking, so I *think* 1.5V might be more common than it should be?

Intel’s spec also says maximum voltage allowed is 1.72V for 13th gen surprisingly : https://community.intel.com/t5/Gami...xt=The Operating Voltage for the,1.2V to 1.5V.

(I’m still curious about these high voltages being “ok” - Intel 32nm definitely saw degradation at 1.45V, and 1.40V was considered a bit too high for 24/7 usage).
 
I can't help but comment on the entire "ring bus" discussion. First of all, Intel does not refer to the intra-CPU communication rings as buses, because they're not. A bus is a shared interconnect which supports one device sending one message at time, and usually all devices on the bus can see every message, so buses have been sometimes used for message broadcast purposes. The bisection bandwidth of a bus is equal to the throughput available on the one shared link. There are several strategies for reserving and releasing buses for more efficient communications (as PCI-X did for PCI), and it is a long discussion which doesn't add any value to this ring discussion.

A ring is an interconnect where each end point device has two interfaces, one to the physical predecessor and one to the successor. Since individual rings work in only one direction, each ring interface has the ability to regulate the ring traffic that goes through it. So, if you have a ring with four devices on it, you can theoretically have four different messages (or message segments) on the ring simultaneously. I don't know what the Intel CPU ring implementation looks like, but some devices employing rings in the past have used two rings, one which functions clockwise and the other counterclockwise. Think of a ring as a switched interconnect with a two port switch in each device.

So, my takeaway from these "ring bus" discussions is that these people really don't know what they're talking about, making it difficult to take what they say seriously.
 
Last edited:
He said it was unlikely to be impacted materially. If he had any clue of what he was talking about he would know that a flaw with the ring bus is completely incompatible with EMR.

Every CPU gets burned in, not some small sample size. With how common the issues seems and the allegation that the problem is a design issue that impacts every RPL die, the probability of not seeing that is basically 0%.

Getting more i9s out of i7s would only matter if intel actually sold many of them. Desktop is like 20% of the CPU market and my bet is that of that 20% less than 10% are i9s. If the problem was binning too many i7s as i9s, why does the desktop i7 also seem to have so many problems.

It is a laughing matter when the information in the video is clearly made up. I explained why that quote 3 is absurd, if you don't want think critically about that information that is your prerogative. Personally I think I would notice the temperature in my fab being like 20 deg F hotter than normal. So if you want to believe that a totally real "source" couldn't even tell you if the HVAC system is working again (over a year after the alleged outage) that is your lose not mine. He could have made up literally any other cause for oxidation and it would be far more believable than what he rolled with.

I know one who is laughing right now. To ground your understanding anyone who says an N5 wafer is priced near 20k has actually no clue what they are talking about or any inside information since the actual price is closer to half of that. For his "source" to claim an intel 7 wafer costs 70k proves without a shadow of doubt that there is no source. As I stated the comments got even more absurd and detached from reality as he went on.

The 13th/14th gen issue is real and you are right it isn't a laughing matter. However this ad-lib of a video is nothing but a laughing matter. When the theory has so many logical holes and has information that obviously false I have no reason to take anything in that video seriously.
What Intel 7 70K per wafer 🤣 it is more at between N6-N4 Price iirc this guy sometimes makes unbelievable stories
 
Last edited:
Back
Top