Huh? Benchmarks show Genoa soundly beating Sapphire rapids with a lower power consumption: https://www.phoronix.com/review/intel-xeon-platinum-8490h/14the fact that SRF can frequently beat Zen4 EPYC efficiency despite the inferior core arch
Array ( [content] => [params] => Array ( [0] => /forum/threads/intel-3-roadmap.20470/page-2 ) [addOns] => Array ( [DL6/MLTP] => 13 [Hampel/TimeZoneDebug] => 1000070 [SV/ChangePostDate] => 2010200 [SemiWiki/Newsletter] => 1000010 [SemiWiki/WPMenu] => 1000010 [SemiWiki/XPressExtend] => 1000010 [ThemeHouse/XLink] => 1000970 [ThemeHouse/XPress] => 1010570 [XF] => 2021770 [XFI] => 1050270 ) [wordpress] => /var/www/html )
Huh? Benchmarks show Genoa soundly beating Sapphire rapids with a lower power consumption: https://www.phoronix.com/review/intel-xeon-platinum-8490h/14the fact that SRF can frequently beat Zen4 EPYC efficiency despite the inferior core arch
Yes, and the 144 core has in one gen closed the gap to almost nothing. The number of workloads where intel wins on perf per watt are now very common. Sure when Turin and dense come out late this year they will widen the gap again, but not to what AMD used to enjoy. What happens when GNR and 244 core SRF come out? Once the dust settles my guess is that intel is less than 1 gen behind on DC now. Who would have thought process leadership mattered?Huh? Benchmarks show Genoa soundly beating Sapphire rapids with a lower power consumption: https://www.phoronix.com/review/intel-xeon-platinum-8490h/14
Since intel didn't mention any new SRAM I assume the uHD 6T bitcell is the same as i4 (0.0240 um^2 if memory serves). N5 is like 0.021um^2 (once again if memory serves). Funnily enough if memory serves the SRAM gap between the two is very similar to the intel 7 vs N7 bitcell area. For their part N3 and N3E have less dense bitcells than N5, but I don't remember their values. Bitcell size isn't the whole story though. TSMC reduced the % of the array used by the periphery logic; allowing their larger bitcell to roughly match N5 MB/mm^2.what is SRAM Cell size on Intel3
Considering that intel 4 perf/watt seemingly is splitting hairs with N4P in HPC scenarios, the fact that SRF can frequently beat Zen4 EPYC efficiency despite the inferior core arch, and I think it is fair to say intel 3 far exceeds N4P in that metric.
It's still 19%, not exactly nothing. And we are talking about AMD's previous years Bergamo versus Intel's just released Sierra Forest.Yes, and the 144 core has in one gen closed the gap to almost nothing.
Good catch Lefty! I didn't expect AMD's 2P platform to make such a large relative difference. It shifted efficiency alot from the 1P example. As for the release date that isn't lost on me. The "problem" as I see it is Turin-dense is not coming out for a little while yet. 288c SRF could well close the gap in 2P and open up a 1P gap with Bergamo (making the safe assumption that the AP socket isn't using double the power). Ontop of that it will kind of be sandwiched by CWF next year. 19% is not nothing but that doesn't really changed the fact that the gap is rapidly narrowing. Me thinks that AMD needs to work on shortening the time between gens if they want to go back to the Milan vs Icelake days (especially if intel takes the lead even by a little bit from TSMC as that means AMD will be very far behind on the process tech front). But maybe they will be forced to start being on TSMC's leading edge.It's still 19%, not exactly nothing. And we are talking about AMD's previous years Bergamo versus Intel's just released Sierra Forest.
Geometric mean of all Test Results:
EPYC 9754 2P = 5905.56
XEON 6780 2P = 4233.57
CPU Power Consumption:
EPYC 9754 2P = 375.51
XEON 6780 2P = 321.01
Perf/Watt
EPYC 9754 2P = 15.72 Points/Watt
XEON 6780 2P = 13,19 Points/Watt
AMD lead over Intel = 19,2%
Very true. Ontop of that AMD lacks a coherent cache so that could have hurt perf in some benchmarks. In AMD's favor you have E core lacking some of the P core accelerators, a better 2P system than intel, no AVX-512, and Intel's cache coherency means intel will have more die to die communication during the task than AMD. Using chip comparisons to talk about process is not a great metric. I was more so using it as one data point in support of the argument for intel 3's ferocity. But I think I will leave it there since one I am not a chip designer and two I am derailing the thread.I agree though Intel may also actually have better (more efficient communications between dice) packaging tech for SPF vs Zen 4 Epyc. Intel also seems to have more dedicated accelerators on its server chips than AMD these days which can make a difference in HPC.
There is also the fact of specific acceleration like AI which Intel is good at and they have good sw support for it they have bunch of accelerators in their COU to speed up specific tasksYes, and the 144 core has in one gen closed the gap to almost nothing. The number of workloads where intel wins on perf per watt are now very common. Sure when Turin and dense come out late this year they will widen the gap again, but not to what AMD used to enjoy. What happens when GNR and 244 core SRF come out? Once the dust settles my guess is that intel is less than 1 gen behind on DC now. Who would have thought process leadership mattered?
Looking at geo mean for phoronix 9754 vs 6780E AMD has a 3% PPW lead vs intel, and with 6766E intel has a 1.05% PPW lead over AMD.
Since intel didn't mention any new SRAM I assume the uHD 6T bitcell is the same as i4 (0.0240 um^2 if memory serves). N5 is like 0.021um^2 (once again if memory serves). Funnily enough if memory serves the SRAM gap between the two is very similar to the intel 7 vs N7 bitcell area. For their part N3 and N3E have less dense bitcells than N5, but I don't remember their values. Bitcell size isn't the whole story though. TSMC reduced the % of the array used by the periphery logic; allowing their larger bitcell to roughly match N5 MB/mm^2.
Nobody uses CPUs for AI - even for inference. A H100 or MI300 will be an order of magnitude faster than any CPU based AI.There is also the fact of specific acceleration like AI which Intel is good at and they have good sw support for it they have bunch of accelerators in their COU to speed up specific tasks
CPUs are used for inferencing not for training https://www.techpowerup.com/319880/google-cpus-are-leading-ai-inference-workloads-not-gpusNobody uses CPUs for AI - even for inference. A H100 or MI300 will be an order of magnitude faster than any CPU based AI.
I'd be surprised if AMD don't mix and match HD and HP cells in different blocks as needed -- why wouldn't they?Would love his opinion. My less discerning eye says this sits squarely between N3E and N4P on a PPA basis. HD logic density same as N4P. HP density is much better than N4P and very close to N3/E. Both TSMC nodes likely win by a a very wide margin when it comes to characteristics mobile APs like (such as switching energy). And on the other side of things, Intel 4/3 have a MIM cap that is generations ahead of all other logic firms. TSMC also like half a node of an SRAM density lead on the N5/N3/N3E families. Considering that intel 4 perf/watt seemingly is splitting hairs with N4P in HPC scenarios, the fact that SRF can frequently beat Zen4 EPYC efficiency despite the inferior core arch, and I think it is fair to say intel 3 far exceeds N4P in that metric. Beyond that it is hard to say if it is at the level of N3E/P. Considering that intel said that the leader's best node in 2024 (which presumably means N3E) was about equal to intel 3 for perf/watt, I think it is safe to assume i3 at the very least isn't better.
The one oddity is the HD logic lib performance. I would not be shocked if you told me that little beast of a transistor had more performance than the minimum sheet width N2 transistor. Of course you would have higher power consumption and area, but I can't stop marveling at how fast that 210 is. I mean the N3E 2-1 is by TSMC's numbers the same speed as N4P's HD lib at iso power. Versus the N4P HP lib it would be slower, but that is totally fine (and expected) because you are getting lower area and less power. For the i3 210h to be that much faster than the i4 240h at same power while effective device width is reduced by 33% is simply insane. It's a shame intel CPU designers seem to rarely use HD libs. In a funny way it seems like AMD's products (which seem to only use HD cells but with uLVT/MIM/overdrive to hit high freq instead of relying on tall cells) would benefit from intel 3 more than intel's own products.
View attachment 2028
M2P track pitch is larger than the minimum M0 pitch. But the cell height for Intel 4 stayed the same as TSMC N7 (240 nm). CGP is a little tighter than TSMC N5 (51 nm).So, that's denser than what was estimated. According to this: https://semiwiki.com/semiconductor-manufacturers/intel/314047-intel-4-presented-at-vlsi/
the metal pitch was meant to be 45nm.
Yes, for small models you can use CPUs, but what use are small models? The AI that's earning the money is based on large models (chatGPT, etc.)CPUs are used for inferencing not for training https://www.techpowerup.com/319880/google-cpus-are-leading-ai-inference-workloads-not-gpus
For small size models CPUs are superior in cost and they are good also newer CPUs have MatrixMath Unit in them
That would have been my thought too. But looking at teardowns I remember being perplexed that their N5P parts only seemed to have the 210h in the core area. Maybe the die to die phy would be different? The other babel about VTs and overdrive was mostly me trying to rationalize how they get the Fmax they do while only using the short cell. Intel mentioned that RC is now over half of the frequency response. Maybe this is just due to intel 7 using an inferior metallization scheme? Or maybe it is the high mobility PMOS that carries the day?I'd be surprised if AMD don't mix and match HD and HP cells in different blocks as needed -- why wouldn't they?
Other than the alternating CHs within a block that isn’t a finflex exclusive thing, no? Only very simple devices like a NAND would ever be the quoted short height yes?Bear in mind that the mixed-height FlexFin libraries also have tall multi-row cells for more complex cells like multi-bit flip-flops and complex multi-input gates. For example the 2-1 fin library (M143) is alternating rows of 1-fin (H117, non-critical paths) and 2-fin (H169) cells, but there are also dual-height H286 cells (1-fin row+2-fin row), triple-height H403 (2x1-fin+2-fin), and H455 cells (2x2-fin + 1-fin). I'm pretty sure the 3-2 fin library has the same kind of mix (H169+H221?), but faster and higher power.
Yes small size model for basic things DLR NLP and stuff is good enough to be run on CPU not everything needs that horsepower also btw Intel 128core GNR will have TOPS equal to Nvidia A100 cause 2048 Int 8 ops/clk 3 Ghz clk and 128 cores roughly equals to 786TOPS so it's not that bad for some larger model as wellYes, for small models you can use CPUs, but what use are small models? The AI that's earning the money is based on large models (chatGPT, etc.)
The bandwidth is the real bottleneck for AI large models. If you don't have HBM memory to feed the cores it's useless. Also, compare price and power draw of a 128 core granite rapids versus a A100.Yes small size model for basic things DLR NLP and stuff is good enough to be run on CPU not everything needs that horsepower also btw Intel 128core GNR will have TOPS equal to Nvidia A100 cause 2048 Int 8 ops/clk 3 Ghz clk and 128 cores roughly equals to 786TOPS so it's not that bad for some larger model as well
Is there a takeaway on how this compares to TSMC N3...... Is @Scotten Jones doing an article on VLSI ?
12 channels MXR Dimms with upto 8800MT/s so roughly 844GB/s but yeah I agree additional HBM like Xeon HBM in Aurora supercomputer would be nice to have alongside thisThe bandwidth is the real bottleneck for AI large models. If you don't have HBM memory to feed the cores it's useless. Also, compare price and power draw of a 128 core granite rapids versus a A100.
You cannot tell from those curves, You need the Id vs Vg at the right Vds. And a 2.5x value on a log scale can be difficult to see. Consider that they claim an improvement in SS so that helps lowering the Ioff at a given Vdd.But it looks like there is no significant increase in drive current at a given Vdd based on those Idd-Vdd curves? My interpretation was that the width had decreased, so 2.5X decrease in leakage current/width and 2x decrease in width. Not completely clear though.