Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/index.php?threads/intel-13th-and-14th-gen-core-i9-stability-problems.20614/
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021770
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

Intel 13th and 14th gen Core i9 stability problems

Xebec

Well-known member
I'm curious what this forum thinks of the increasing reports of "13th and 14th gen Core i9 stability issues" that are appearing in the media. I've compiled a few links below talking about different aspects of this issue.

The overall issue manifests as spontaneous rebooting, blue screens, or apps that crash to desktop at random intervals; and particularly hard hit seem to be software that does decompression - whether it's a game doing texture "stuff", or 7-zip or those types of applications. Intel has released updated power / voltage recommendations for motherboard vendors, but that doesn't appear to have addressed the issue.

Intel is also reported to be fielding a very high level of RMA for these processors; and it's specifically the i9 (highest clocked, highest core count) processors. The 12th gen does not appear to be affected, and the 13th and 14th gen i7s only seem to be mildly affected (not nearly as clearly as the i9). In 13th and 14th gen, the i7 is lower clocked and has 8 performance + 8 efficient cores, while the i9s are 8+16.

Informational/source links are below. I've summarized take-aways for each link:

Game Developer(s) indicating that >80% of all reported crashes in their games occur on these specific processors:


Level1Techs: Uncovered a lot of data from those who host game servers that show the servers are very unstable on the i9 processors, losing customers, and causing game hosts to drop the equipment for AMD. Game Servers love single threaded performance which is why they've been using Core i9. Worth noting is they tend to use "W" / workstation chipsets, that will not expose these processors to very high power limit levels like consumer boards. He also cited warranty costs going up 8x for servers based on Core i9 vs. a few months ago. L1T dove into a lot of publicly available logging information to verify high failure rates.


GamersNexus: Discussion with Level1Techs, Steve (owner) from Gamers Nexus indicated he thinks he knows what the problem is, and he thinks Intel does too, but he isn't saying more. They strongly implied the fix needs to be done at the hardware level.


Moore's Law is Dead: Cites several sources for increasing RMAs on these processors. Tom (owner) also stated that the problem is some combination of the # of cores on the ring bus (Core i9 is 8+16), and other things causing premature failure somewhere in the cache/memory functional area.


Igors Labs: Quoting an Intel doc: Failure Analysis (FA) of 13th and 14th Generation K SKU processors indicates a shift in minimum operating voltage on affected processors resulting from cumulative exposure to elevated core voltages. Intel® analysis has determined a confirmed contributing factor for this issue is elevated voltage input to the processor due to previous BIOS settings which allow the processor to operate at turbo frequencies and voltages even while the processor is at a high temperature. Previous generations of Intel® K SKU processors were less sensitive to these type of settings due to lower default operating voltage and frequency.

 
I think this is much bigger than people realise, because the crashes increase in frequency over time.
Could it be a problem with the so-called Intel 7 Ultra node which is used for Raptor lake?
 
I think this is much bigger than people realise, because the crashes increase in frequency over time.
Could it be a problem with the so-called Intel 7 Ultra node which is used for Raptor lake?
Never say never I guess. And guilty until proven innocent is the way of the fab. When I first saw this surface I was thinking that this was just design pushed the redline on freq too hard. However current evidence makes it look doubtful that process is a root cause (although maybe there is a contributing factor). The FPGAs, ADL, EMR, SPR, and non i9 RPLs don’t seem to have issues. The problems seem to also be frequency and wattage independent. If there was an excursion it would be across more product skus and a lower percentage of parts. If the power was pushed too far you would not see issues at low power. If there is some wonky process marginality it would likely show up in more places such EMR or even be far more prevalent with the i7s since they share the same die. As far as I know I haven’t heard of any issues with U/P/H RPL either.
 
Never say never I guess. And guilty until proven innocent is the way of the fab. When I first saw this surface I was thinking that this was just design pushed the redline on freq too hard. However current evidence makes it look doubtful that process is a root cause (although maybe there is a contributing factor). The FPGAs, ADL, EMR, SPR, and non i9 RPLs don’t seem to have issues. The problems seem to also be frequency and wattage independent. If there was an excursion it would be across more product skus and a lower percentage of parts. If the power was pushed too far you would not see issues at low power. If there is some wonky process marginality it would likely show up in more places such EMR or even be far more prevalent with the i7s since they share the same die. As far as I know I haven’t heard of any issues with U/P/H RPL either.
The issue is Mostly with the 8+16+1 die that is used by i9 and i7 for 13th/14th gen everything else is working fine
 
The issue is Mostly with the 8+16+1 die that is used by i9 and i7 for 13th/14th gen everything else is working fine
We can't really assume that, because initially the i9 and i7's were working fine too. It's a problem that degrades with time and is exacerbated by running at high voltages. And btw low end 13 series are mostly alder lake, not raptor lake.
 
We can't really assume that, because initially the i9 and i7's were working fine too. It's a problem that degrades with time and is exacerbated by running at high voltages.
My understanding is that it is rare for the i7s to act up and that the overwhelming majority are of problem CPUs are the i9s. Combine the relatively few number of i7s reported (seems like 1-2 orders of magnitude fewer reported failures on the i7s vs the i9s) and the fact that intel sells many times more i7s than i9s and I think it is obvious that the failure rate is MUCH higher on the i9s than the i7s. As for voltage having an impact watch the first video. Even at highly reduced voltages for a server config, the problem seems just as bad if not worse.
And btw low end 13 series are mostly alder lake, not raptor lake.
They may have the same cache config of ADL, but they still run on the same process vintage as the 8+16+1 die. If they didn't there is no way they could hit the higher frequencies at same or lower power with same or more cores (depending on sku) that they do. Since the issue only seems to be manifesting with that 8+16+1 die and not the 8+8+1, 6+0+1, 6+8+2, 2+8+2, or EMR dies running on the same process derivative, it does seem to be an issue specifically with that die.
 
My understanding is that it is rare for the i7s to act up and that the overwhelming majority are of problem CPUs are the i9s. Combine the relatively few number of i7s reported (seems like 1-2 orders of magnitude fewer reported failures on the i7s vs the i9s) and the fact that intel sells many times more i7s than i9s and I think it is obvious that the failure rate is MUCH higher on the i9s than the i7s. As for voltage having an impact watch the first video. Even at highly reduced voltages for a server config, the problem seems just as bad if not worse.

They may have the same cache config of ADL, but they still run on the same process vintage as the 8+16+1 die. If they didn't there is no way they could hit the higher frequencies at same or lower power with same or more cores (depending on sku) that they do. Since the issue only seems to be manifesting with that 8+16+1 die and not the 8+8+1, 6+0+1, 6+8+2, 2+8+2, or EMR dies running on the same process derivative, it does seem to be an issue specifically with that die.
They are run on the same node as Alder lake. There is even an article about the infamous 13400F: "This time the difference in the size of the dies and their native features is quite a bit smaller, but the manufacturing process is different, with stepping B0 being the more modern Intel 7 Ultra."
 
My understanding is that it is rare for the i7s to act up and that the overwhelming majority are of problem CPUs are the i9s. Combine the relatively few number of i7s reported (seems like 1-2 orders of magnitude fewer reported failures on the i7s vs the i9s) and the fact that intel sells many times more i7s than i9s and I think it is obvious that the failure rate is MUCH higher on the i9s than the i7s. As for voltage having an impact watch the first video. Even at highly reduced voltages for a server config, the problem seems just as bad if not worse.

They may have the same cache config of ADL, but they still run on the same process vintage as the 8+16+1 die. If they didn't there is no way they could hit the higher frequencies at same or lower power with same or more cores (depending on sku) that they do. Since the issue only seems to be manifesting with that 8+16+1 die and not the 8+8+1, 6+0+1, 6+8+2, 2+8+2, or EMR dies running on the same process derivative, it does seem to be an issue specifically with that die.

The Gamers Nexus and Level1Techs videos briefly talked about how Linux was occasionally dropping individual cores on the i9's and that was (temporarily) regaining stability. Sometimes it's an e-core(cluster) or a p-core that can be removed to bring back stability to the CPU.

Just a comment on voltages - while the consumer boards definitely overvolted a bit too much when hitting 5.8, 6.0, and 6.2 GHz on 1-2 cores.. The i9s (13900K, 14900K, 14900KS respectively) require something on the order of 1.5V out of the box.

The main difference between the server boards and consumer is the total amperage draw is more limited, effectively reducing local heat spots. The actual voltage limits of the server boards could be +/- consumer boards; less because no OC profiles are needed, but equal or more if they want to ensure stability at ~ 6 GHz * 1-2 core workloads. Even muddier is servers could actually run hotter than consumer PCs because of density requirements offsetting lower maximum power draw.

..

If the problem is temperature - servers and consumer will probably fail about the same rate. If it's about core counts - i9 will fail first. If it's about total power draw (local hotspots) - consumers should fail more than servers. If it's about voltage or frequency - then consumers and servers should fail about the same rate.
 
They are run on the same node as Alder lake. There is even an article about the infamous 13400F: "This time the difference in the size of the dies and their native features is quite a bit smaller, but the manufacturing process is different, with stepping B0 being the more modern Intel 7 Ultra."
Good find. Presumably mobile is all RPL since the power consumption dropped so much for all skus going from 12-13th gen. I also wouldn't be shocked if over time more of the 13th gen and maybe all of the 14th gen mainstream CPUs were all RPL si to give it that little extra. It is also worth noting that depending on what changed process side, the CPUs can be on the new process with no stepping change and all that would be missing is the small RPL design enhancements (which might be the RC for the small perf and power deltas between different 13400Fs). Given the strict reliability requirements and higher operating temps you would think that EMR qual would have found any issues with changes made going from intel 7 to intel 7+ if they existed.
 
Wendell and I spoke about this in our latest podcast. He gave some updates: https://www.youtube.com/live/5KHCLBqRrnY?si=UP3igrCkjQEYQ4pm&t=2001

Since that podcast, we've spoken more, and I've had some thoughts. I put it in a twitter thread, but can also post here.

1. Been thinking about the root cause of the Intel CPU issues. It happens even on non-K/KS, even in stable environments. Wendell and I discussed privately about electromigration and degradation, but I'm starting to wonder if it's more physical than that.
2. The socket latch has long been a point of concern for thermals. Lots of pins, not a lot of uniform pressure. If ground pins aren't connecting, that increases issues even if the CPU still works due to resistance/impedance.
3. Not only that, but the physical warpage of the CPU over time. We were seeing overclockers having issues with the default socket latch on day one, causing motherboards and CPUs to bend.
4. If the silicon inside is feeling torque due to anticlastic deformation via the socket, that could cause non-regular issues long-term that are hard to pin down in software. A symptom of that could be increased electromigration, making the chip design not the cause, but the socket.
5. This means that the silicon in the chip could be experiencing deformation at a fundamental level. Power and data connections, if brittle enough, could increase resistance due to deformation. Normally this isn't an issue in regular use, but neither is it tested for in validation.
6. I should stress I have no data to back this up, aside from the seemingly unfocused nature of the errors affecting all systems, inc non-K and T, and even in underclocked environments focused on stability. The fact it's more of a thing now is a factor of time.
7. We're seeing more high-end CPUs because of the fact that those units typically need more mounting pressure, or might use boards without a mounting plate. The cooling requirements are higher, the current draw is higher, so any shear twist or torque will exacerbate over time.
8. This is one of the many reasons why we love sockets that apply equal pressure.

Simply put, if electromigration is happening, it's a symptom rather than a cause. It's on the low end SKUs as well, even stable ones.
Why not Alder (12th Gen) over Raptor? Amplified problem. 12th gen still appears in the data.
 
Wendell and I spoke about this in our latest podcast. He gave some updates: https://www.youtube.com/live/5KHCLBqRrnY?si=UP3igrCkjQEYQ4pm&t=2001

Since that podcast, we've spoken more, and I've had some thoughts. I put it in a twitter thread, but can also post here.

1. Been thinking about the root cause of the Intel CPU issues. It happens even on non-K/KS, even in stable environments. Wendell and I discussed privately about electromigration and degradation, but I'm starting to wonder if it's more physical than that.
2. The socket latch has long been a point of concern for thermals. Lots of pins, not a lot of uniform pressure. If ground pins aren't connecting, that increases issues even if the CPU still works due to resistance/impedance.
3. Not only that, but the physical warpage of the CPU over time. We were seeing overclockers having issues with the default socket latch on day one, causing motherboards and CPUs to bend.
4. If the silicon inside is feeling torque due to anticlastic deformation via the socket, that could cause non-regular issues long-term that are hard to pin down in software. A symptom of that could be increased electromigration, making the chip design not the cause, but the socket.
5. This means that the silicon in the chip could be experiencing deformation at a fundamental level. Power and data connections, if brittle enough, could increase resistance due to deformation. Normally this isn't an issue in regular use, but neither is it tested for in validation.
6. I should stress I have no data to back this up, aside from the seemingly unfocused nature of the errors affecting all systems, inc non-K and T, and even in underclocked environments focused on stability. The fact it's more of a thing now is a factor of time.
7. We're seeing more high-end CPUs because of the fact that those units typically need more mounting pressure, or might use boards without a mounting plate. The cooling requirements are higher, the current draw is higher, so any shear twist or torque will exacerbate over time.
8. This is one of the many reasons why we love sockets that apply equal pressure.

Simply put, if electromigration is happening, it's a symptom rather than a cause. It's on the low end SKUs as well, even stable ones.
Why not Alder (12th Gen) over Raptor? Amplified problem. 12th gen still appears in the data.
Do we have Laptop HX data to compare cause they don't have socket issue and the die is same So if HX is not having issues but desktop is the bending/pressure part can be cleared out.
 
2. The socket latch has long been a point of concern for thermals. Lots of pins, not a lot of uniform pressure. If ground pins aren't connecting, that increases issues even if the CPU still works due to resistance/impedance.
Supposing it is the socket that's causing the problems, what would Intel have to do to fix the problem?
 
Ian knows more that I ever will on the architecture and potential causes and impacts. I think the Intel challenge is going to be how to react and the PR to the reaction.

In reality, Intel could ignore the whole thing and 99% of PC users will be unaffected or not know there is an issue. move on to the next node, take the PR hit now, Fix arrow lake. RMA bad units without pushback. "stop driving in the rear view mirror". I think this is where Intel will go (And I think it is correct).

However if it gets to the point where it IS impacting Arrow lake sales, moves to AMD, then Intel will have to respond and the reaction will be "too slow to respond" and severe brand impact. I know people who wont fly on a Boeing plane (true) because they swear that 1 out of 10 planes will crash (exaggeration)

@IanCutress Is there any chance Arrow Lake could have a similar issue (On N3 or 20A)?
 
I think they have an optional for ARL LGA-1851 Boards https://x.com/jaykihn0/status/1808315008000143799
1000077772.png
 
Intel could offer new mounting plates, though in reality only ever a small amount of people will replace their socket.

AFAIK they're simply offering replacements to enterprise.
The problem is, Intel 7 is an expensive process. Back in 14nm, they could replace every chip and still make money (probably). This time around, not so much.
 
Back
Top