Where are all the transistors going in modern chip designs?

M. Y. Zuo · Mar 5, 2022

I was reminiscing about the bright and early days of multi core computing, particularly about Intel’s first ’Nehalem’ chips back in the fall of 2008.

I couldn’t believe how few transistors those chips had compared to modern designs!

For example,
Intel i7-940 had 731 million transistors. Geekbench 5 is showing a score of 2050. Released Fall 2008.
vs.
Intel i7-12700K. Over 9 billion transistors, or so I heard correct me if otherwise. Geekbench 5 shows a score of 13790. Released winter 2022.

Approximately the same TDP for both (and higher ‘boost’ TDP for the new chip).

For 12x to 13x the number of transistors, and after 13 years of architectural improvements, Intel’s latest and greatest i7 is only 6.5x faster than their first i7.
I don’t have the latest chip on hand to benchmark but I tend to believe Geekbench figures with some margin of error, particularly as Geekbench 5 multi core tests use nearly ideal multi-core workloads.

Cache size is 25 MB for the new chip vs 8 MB for the old chip, so more SRAM explains some of the difference. Likewise with more I/O and hardware features such as h.265 encoder/decoders, IME, etc. The integrated GPU on the new chip probably adds another billion+ transistors too.

A naive multiplication of 731 million by 6.5 yields ~4.7 billion transistors. Tack on an extra 2 billion for all the new stuff and you get 6.7 billion.

In any case that still leaves quite a large portion unaccounted for. Where is Intel spending those extra billions of transistors?

Xebec · Mar 5, 2022

The newer CPU also handle a lot more instructions than the older one (i.e. no AVX of any kind in 940). 12700K also has an igpu which i7-940 didn't have. The newer CPU can also go down lower in power requiring more transistors for management of said power states..

Beyond that gains in performance on general purpose integer code isn't just about doubling transistors for doubling performance. It gets harder and harder to extract more IPC (something more than 2x transistors = 2x gain) over time. In addition newer CPUs have a lot more security features and logic than older ones.

M. Y. Zuo · Mar 6, 2022

Xebec said:
The newer CPU also handle a lot more instructions than the older one (i.e. no AVX of any kind in 940). 12700K also has an igpu which i7-940 didn't have. The newer CPU can also go down lower in power requiring more transistors for management of said power states..

Beyond that gains in performance on general purpose integer code isn't just about doubling transistors for doubling performance. It gets harder and harder to extract more IPC (something more than 2x transistors = 2x gain) over time. In addition newer CPUs have a lot more security features and logic than older ones.

Thanks. I totally forgot about the difference in instruction sets. That would certainly explain a lot of it.
I didn’t realize that the power management transistors would be a non-trivial amount. What sort of order of magnitude is needed to enable the lower power states? 100 million extra?

And I had always thought design improvements could enable linear scaling. Though it would explain the rest of the difference if more transistors are needed to enable more IPC on a superlinear basis.
If we assume all the extra features and functionality requires an extra 2.5 billion transistors, then the equivalent of the Nehalem chip in 2022, without AVX, GPU, IME, etc., would be ~ 6.5 billion transistors?
Roughly 8.9x the original chip for 6.5x the performance.

Paul2 · Mar 6, 2022

Xebec · Mar 6, 2022

M. Y. Zuo said:
Thanks. I totally forgot about the difference in instruction sets. That would certainly explain a lot of it.
I didn’t realize that the power management transistors would be a non-trivial amount. What sort of order of magnitude is needed to enable the lower power states? 100 million extra?

And I had always thought design improvements could enable linear scaling. Though it would explain the rest of the difference if more transistors are needed to enable more IPC on a superlinear basis.
If we assume all the extra features and functionality requires an extra 2.5 billion transistors, then the equivalent of the Nehalem chip in 2022, without AVX, GPU, IME, etc., would be ~ 6.5 billion transistors?
Roughly 8.9x the original chip for 6.5x the performance.

There are way better experts to chime in here than me (I'm no expert

), but there are several ways you could try to determine an "equivalent" chip today.

Are you looking for similar die size ? Then compare Intel 45nm density to "Intel 7" or TSMC n7 and calculate however many transistors they would provide vs the same size as nehalem (on Intel 45nm). Wikichip has a lot of good info on transistors per mm2, but of course type of transistor matters (I e. SRAM is denser than logic though logic has been scaling down better than SRAM lately - thanks Paul2).

Are you looking for equivalent performance - i.e. what would nehalem look like on a modern node? This is hard to predict but Intel Gracemont cores are slightly more stripped down in instructions than big cores and not focused on clock speed.

Best performance at equivalent thermals? Look at modern CPUs running real workloads to see what they use as different applications can have wildly different impacts.

P.S. I7 -940 was higher end than 12700K is today because the 900 series had 3 channels of memory where 12700K is dual channel . So in some ways 12700K needs less transistors to do the same.. that said DDR 4/5 support is a lot more transistor heavy than just DDR3.

mgoldsmith1979 · Mar 7, 2022

Increasing parallelism on the back end of the machine (to increase IPC) also means increasing the prediction and fetching capabilities of the front-end in order to feed the beast, as it were. So some of that allocation goes into the prefetch and branch prediction, which also increases the queue depths, which means both logic and SRAM transistor increases there.
You mentioned the non-CPU elements, but also the interconnecting fabric grows in complexity the more cores you have on the same die (eg: bidirectional ring architecture with distributed L3$ versus a switch). "More" IO not only in terms of more lanes of equivalent PCIe / DRAM, but also novel IP like Thunderbolt are now on-die whereas they would have been separate or part of the Southbridge/PCH before.

You see similar architecture trends in the mobile space as well. ARM standard designs now have a significantly more complicated uncore (DynamIQ) than when they only had 4C modules with shared L2$, and the chassis-level interconnects are also growing in complexity (NoCs vs AXI), and system-level caches are becoming more popular. Apple designs in general throw a tremendous amount of SRAM to support coherency between their CPU and GPUs / NPU / etc, and even so are still pushing DRAM BW to crazy levels (for a client part) in designs like the M1Max (4*128b LP5, equivalent to 8 channels of DDR4/5, which is where Zen3 server parts are at).

One of the paths to power efficiency in modern design is specialized accelerators (as you mentioned, ISPs, VDEC, NPUs, etc) that offload the main CPU and execute tasks in parallel using a more power efficient way, but if you aren't performing that particular function, those accelerators get clock/power gated and become "dark silicon". So spending more transistors on things that are only used sometimes, which further increases your die area.

count · Mar 7, 2022

To add to the above, processors are power limited these days. You can't have all those transistors firing all the time because the chip would melt.

M. Y. Zuo · Mar 8, 2022

mgoldsmith1979 said:
Increasing parallelism on the back end of the machine (to increase IPC) also means increasing the prediction and fetching capabilities of the front-end in order to feed the beast, as it were. So some of that allocation goes into the prefetch and branch prediction, which also increases the queue depths, which means both logic and SRAM transistor increases there.
You mentioned the non-CPU elements, but also the interconnecting fabric grows in complexity the more cores you have on the same die (eg: bidirectional ring architecture with distributed L3$ versus a switch). "More" IO not only in terms of more lanes of equivalent PCIe / DRAM, but also novel IP like Thunderbolt are now on-die whereas they would have been separate or part of the Southbridge/PCH before.

You see similar architecture trends in the mobile space as well. ARM standard designs now have a significantly more complicated uncore (DynamIQ) than when they only had 4C modules with shared L2$, and the chassis-level interconnects are also growing in complexity (NoCs vs AXI), and system-level caches are becoming more popular. Apple designs in general throw a tremendous amount of SRAM to support coherency between their CPU and GPUs / NPU / etc, and even so are still pushing DRAM BW to crazy levels (for a client part) in designs like the M1Max (4*128b LP5, equivalent to 8 channels of DDR4/5, which is where Zen3 server parts are at).

One of the paths to power efficiency in modern design is specialized accelerators (as you mentioned, ISPs, VDEC, NPUs, etc) that offload the main CPU and execute tasks in parallel using a more power efficient way, but if you aren't performing that particular function, those accelerators get clock/power gated and become "dark silicon". So spending more transistors on things that are only used sometimes, which further increases your die area.

Thanks for the enlightening comment! Do you expect this trend to continue for the forseeable future? i.e. that Intel’s i7 chip by 2030 could be 50 billion plus transistors with some very large portion, perhaps more than half, being ’dark silicon’ that remains unused most of the time?

(As a sidenote, I’ve heard that the all in cost per transistor of TSMC’s N5 is about the same as N7 for Apple. And considering that everyone else is a lower volume customer, it seems quite likely that TSMC’s N3 will end up costing significantly more per transistor. If that’s the case would you expect an alternative paradigm to build up, to economize on SRAM, etc.?)

aayres · Mar 8, 2022

There is also the DDB or SDB in FinFet, those dummy transistors are used as well for marketing count bragging.

Another issue is the clock tree. Large area chips may have longer clock trees which requires a lot of buffers to eliminate critical paths for performance sign-off. Also the fan-out plays a role in the number of required buffers, and with new nodes reduced standard-cell tracks even more buffers are required to drive critical paths.

When you plug all of this in your power budget, almost everything becomes dark silicon.

IanD · Mar 8, 2022

M. Y. Zuo said:
Thanks for the enlightening comment! Do you expect this trend to continue for the forseeable future? i.e. that Intel’s i7 chip by 2030 could be 50 billion plus transistors with some very large portion, perhaps more than half, being ’dark silicon’ that remains unused most of the time?

(As a sidenote, I’ve heard that the all in cost per transistor of TSMC’s N5 is about the same as N7 for Apple. And considering that everyone else is a lower volume customer, it seems quite likely that TSMC’s N3 will end up costing significantly more per transistor. If that’s the case would you expect an alternative paradigm to build up, to economize on SRAM, etc.?)

Cost per transistor is still dropping each node, but more slowly than it used to do because increased EUV use is increasing wafer cost.

However SRAM is not scaling as fast as logic, so if the chip area is dominated by SRAM this might not be true, and cost per transistor might be flat or even increasing.

The real reason for moving to the next node nowadays is either power saving (also reduction per node has come down) or because you went to get more gates/RAM on the chip without die size increasing which makes yield drop.

Power density (W/mm2) for active circuits is also going up because gate density is going up faster than power is falling. Luckily because the chip complexity is also going up (e.g. adding accelerators) the amount of dark silicon at any instant is also going up, so total chip power dissipation is pretty much unchanged.

Search

Where are all the transistors going in modern chip designs?

M. Y. Zuo

Active member

Xebec

Well-known member

M. Y. Zuo

Active member

Paul2

Well-known member

Xebec

Well-known member

mgoldsmith1979

Guest

count

Well-known member

M. Y. Zuo

Active member

aayres

New member

IanD

Well-known member