Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/index.php?threads/alternatives-to-shrink-for-performance.19955/
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021370
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

Alternatives to Shrink for Performance?

Arthur Hanson

Well-known member
Now that shrink is coming to an end, what are the best alternatives for improving performance? Layering, optical, faster speeds, different architectures, breaking inputs up for parallel processing? What companies will lead?
 
As long as there's value added, we can use everything. For example, in HBM(memory) manufacturing, stacking is used a lot even if price per bit goes up. They can use that because, AI companies are willing to pay A LOT.

In logic side, I think there will be a 'in-house design' era where large IT companies decide transistor usage on their own. If transistors are expensive(more like price reduction is slowed down) one should economize transistor usage first. TSMC-driven IP ecosystem combined with large in-house demand of IT companies made this possible.
In manufacturing side, there are lots of pathfinding going on but none of pathfindings seem game changer yet. They'll keep using 3D geometry(GAA, forksheet, atomic channels...etc) for a while but this will stop anyway in foreseeable future.

In general, if tech migration is stopped then we're going back to 1970s when everyone used BJTs to design their own CPUs and memories...
 
Personally, I would argue that Vertical integration will be the next major avenue to further increase transistor density over the forthcoming decades. This method nonetheless has enormous difficulties that will require a complete alteration in manufacturing process and techniques involving at-least a decade for the industry to solidify these solutions. It will definitely not rekindle moors law, but will allow further consolidation to further transistor density improvement.

This transition to vertically based integration is already evidently visible, given we are already seeing a basic implementation of this in the form of Sram stacked on logic, and the planned introduction of backside power delivery that will be implemented on Intel 1.4nm, and TSMC 2nm nodes. This will then be subsequently followed by the incorporation of the stacking of NFET, and PFET type transistors called CFET that are scheduled to enter production in late 2020’s to the early 2030. While lastly, I also predict the incorporation of transistors involved with power management and clock syntonisation to the backend-of-the-line (BEOL) to offset some functionality away from the expensive high density logic layers.

In respect to optical interconnection, and non-volatile memory, I believe they will also have a place in this new paradigm. The utilisation of chiplet based designs connected via integrated optical interconnects should allow a significant reduction in energy consumption, while theoretically offering increased inner chiplet bandwidth. While emerging memories such as as ReRam, Mram, or Pram will eventually form the basis for exploiting near memory computation to reduce the detrimental effects of the von Neumann bottleneck.

I don’t expect any radical departure from the current leaders within the semiconductor industry over the next decade. Since the evolution of these technologies is a mutual industry led joint effort that are dependent on moderate to high levels of government support and involvement. Beyond ten years, the most predominate aspect will be the emergence of China based companies who could pose a direct threat in the deposition, and etching equipment market, while five to ten years following this, viable competitors from China within the disciplines of lithography, and computational lithography will potentially result in substantial industry shakeups.
 
Last edited:
The problem with stacking for logic (unlike memory) is power density and heat extraction; the power density (W/mm2) for a single layer of logic is already going up node by node because density is going up faster than power per gate is shrinking, so it's getting more and more difficult to cool devices, especially big high-activity ones like GPUs. GAA FETs are worse than FinFETs for this, and CFETs even more so. And as soon as you stack multiple chips getting the heat out gets much harder still, both because thermal resistance goes up (especially for the middle chips if there are more than 2 layers) and so does power density (cooling problem).

Until/unless this heat/cooling problem is solved stacking for (high-power-density) logic chips isn't going to give us a way out... :-(

And yes there has been work done on liquid cooling through microchannels, but for these to work the logic layers in the stack need to be widely separated vertically -- which not only means interconnect problems, but goes directly against the trend to use small TSVs and direct bonding of thinned wafers to get increased interconnect density and speed between the layers...
 
The problem with stacking for logic (unlike memory) is power density and heat extraction; the power density (W/mm2) for a single layer of logic is already going up node by node because density is going up faster than power per gate is shrinking, so it's getting more and more difficult to cool devices, especially big high-activity ones like GPUs. GAA FETs are worse than FinFETs for this, and CFETs even more so. And as soon as you stack multiple chips getting the heat out gets much harder still, both because thermal resistance goes up (especially for the middle chips if there are more than 2 layers) and so does power density (cooling problem).

Until/unless this heat/cooling problem is solved stacking for (high-power-density) logic chips isn't going to give us a way out... :-(

And yes there has been work done on liquid cooling through microchannels, but for these to work the logic layers in the stack need to be widely separated vertically -- which not only means interconnect problems, but goes directly against the trend to use small TSVs and direct bonding of thinned wafers to get increased interconnect density and speed between the layers...
I personally don’t anticipate commercially viable high density logic on logic transistors for another fifteen to twenty years based heavily on the reasons you have specified.

The vertically integration I referenced in my prior comment is more associated with changing the transistor structure and the interconnects to a more vertically orientated structure, and design. Along with the stacking of memory on logic for near memory computation.

I nonetheless believe that eventually the industry will come up with effective methods to stack logic on logic without leading to excessive thermal resistance be it though either more exotic transistor designs such as CNTFET, and TFET, or though an entirely new technology paradigm. One aspect is notably clear to me, two decades worth of research in a trillion dollar industry can result in some incredible innovations.
 
I personally don’t anticipate commercially viable high density logic on logic transistors for another fifteen to twenty years based heavily on the reasons you have specified.

The vertically integration I referenced in my prior comment is more associated with changing the transistor structure and the interconnects to a more vertically orientated structure, and design. Along with the stacking of memory on logic for near memory computation.

I nonetheless believe that eventually the industry will come up with effective methods to stack logic on logic without leading to excessive thermal resistance be it though either more exotic transistor designs such as CNTFET, and TFET, or though an entirely new technology paradigm. One aspect is notably clear to me, two decades worth of research in a trillion dollar industry can result in some incredible innovations.

Even if the problem of thermal resistance though the stacked chips can be magically solved, that still leaves the power density problem, which is still going up none-by-node because density is growing faster than power per gate is decreasing, and there's no signs of this getting any better -- in fact it's likely to get worse. New technologies like GAA and CFET tend to increase density more than they decrease power, which makes the problem even worse -- and they're a one-off change, the scaling problem then resumes after they've been introduced. We're over 1W/mm2 today, with stacked logic in a few years we'll hit 10W/mm2... :-(

There have indeed been lots of great technology advances, but we need a fundamental and massive decrease in power consumption per gate to make logic stacking feasible -- at least in "normal" applications which can't resort to exotic and expensive cooling solutions like microfluidics and diamond heatsinks, great for cost-no-object applications (IBM mainframes?) but not for 99.9% of the world... :-(

Since in the end power comes down to CV^2*F and F is what we need to at least keep the same (or even improve), and reducing C significantly is difficult and slowing down per node, lower voltage is the only way to go -- but this needs transistors (or switching devices) with much steeper subthreshold slope and better Vth control (to reduce off leakage) and much better gm/I and speed with low (Vgs-Vth). No amount of tweaking/shrinking CMOS FETs with current materials is going to deliver this, a fundamental change in device is needed.
 
Even if the problem of thermal resistance though the stacked chips can be magically solved, that still leaves the power density problem, which is still going up none-by-node because density is growing faster than power per gate is decreasing, and there's no signs of this getting any better -- in fact it's likely to get worse. New technologies like GAA and CFET tend to increase density more than they decrease power, which makes the problem even worse -- and they're a one-off change, the scaling problem then resumes after they've been introduced. We're over 1W/mm2 today, with stacked logic in a few years we'll hit 10W/mm2... :-(
Yes, and AI computation is already at the thermal limit with close to ideal layouts. It can gain a bit from functional diversity with denser logic (separate pipelines for int8, FP8, BF16, etc.) but as more of the modelling is simply learning to use the fastest units - FP4/Int8 which are essentially the same computation after unpacking - the diversity will be of marginal benefit. It is a bit like what happened to SRAM where we reached the ideal layout and elements at the 7n node and have barely shifted since. CV^2 is now doing that to logic. Liquid cooling might get us to a few hundred W per cm2, though the junction temps are likely to be scorching.
Since in the end power comes down to CV^2*F and F is what we need to at least keep the same (or even improve), and reducing C significantly is difficult and slowing down per node, lower voltage is the only way to go -- but this needs transistors (or switching devices) with much steeper subthreshold slope and better Vth control (to reduce off leakage) and much better gm/I and speed with low (Vgs-Vth). No amount of tweaking/shrinking CMOS FETs with current materials is going to deliver this, a fundamental change in device is needed.
C is projected to decrease as much as 10% by the time we get to CFET (4 years?). Vertical GAA might do better, unclear. V depends on temperature. We probably cannot run even leaky AI computation below about 0.35V at > 350K junction temps.

I have a blog post from a couple of weeks ago about this, with some references cited. DM me if you want the link. Not sure the rules here allow me to embed it.
 
Yes, and AI computation is already at the thermal limit with close to ideal layouts. It can gain a bit from functional diversity with denser logic (separate pipelines for int8, FP8, BF16, etc.) but as more of the modelling is simply learning to use the fastest units - FP4/Int8 which are essentially the same computation after unpacking - the diversity will be of marginal benefit. It is a bit like what happened to SRAM where we reached the ideal layout and elements at the 7n node and have barely shifted since. CV^2 is now doing that to logic. Liquid cooling might get us to a few hundred W per cm2, though the junction temps are likely to be scorching.

C is projected to decrease as much as 10% by the time we get to CFET (4 years?). Vertical GAA might do better, unclear. V depends on temperature. We probably cannot run even leaky AI computation below about 0.35V at > 350K junction temps.

I have a blog post from a couple of weeks ago about this, with some references cited. DM me if you want the link. Not sure the rules here allow me to embed it.

That's what I meant when I said lower C isn't going to save us, it's not dropping very rapidly -- even lower-K dielectrics are unlikely, wires are getting more densely packed, gate capacitance is dropping a bit as smaller transistors give more drive but again this is low -- real channel length has been around 17nm ever since N7, N3 is almost the same, N2 with GAA might reduce it by a couple of nm (not sure). The biggest gains with each node now are density and these are mainly from DTCO not raw pitches, power savings are gradually decreasing with each node.

Operating voltage is the only thing left which can drastically reduce power, especially because of the square-law, but you give away speed in return for this. If power is the #1 priority and area/chip cost is of less concern, the optimum operating voltage is already less than 0.5V, you can get big power savings this way but the reduced speed means maybe halving the clock rate and doubling the silicon area (double the parallelism), and few applications can afford to do this.

You can also do adaptive Vdd to maintain the same speed over process corners and temperature -- lower Vdd for fast hot chips, higher Vdd for slow cold chips -- and this also greatly reduces the chip-to-chip power variation (especially worst-case), but again many systems can't afford the complexity (and power supplies) to do this.
 
That's what I meant when I said lower C isn't going to save us, it's not dropping very rapidly -- even lower-K dielectrics are unlikely, wires are getting more densely packed, gate capacitance is dropping a bit as smaller transistors give more drive but again this is low -- real channel length has been around 17nm ever since N7, N3 is almost the same, N2 with GAA might reduce it by a couple of nm (not sure). The biggest gains with each node now are density and these are mainly from DTCO not raw pitches, power savings are gradually decreasing with each node.
K change makes no difference, since gate capacitance is set by need to control the field in the gate. The point of high-K was to increase thickness to avoid tunneling while applying the same field to the gate.

GAA shortens the gate, which is nice for CPP reduction, but it adds a 4th side so C is pretty much the same. Which goes back to controlling the gate. I'm not sure if 2D material ribbons would be an improvement because controlling the gate is easier with a weaker capacitance? Have not seen it discussed in the research - processes are judged on density and they do care about STS for voltage, but rarely is Ceff discussed and (unlike size) it seems never quantified, treated as secret by the vendors. Which is not good for progress - what does not get measured (competitively) does not get improved.
Operating voltage is the only thing left which can drastically reduce power, especially because of the square-law, but you give away speed in return for this. If power is the #1 priority and area/chip cost is of less concern, the optimum operating voltage is already less than 0.5V, you can get big power savings this way but the reduced speed means maybe halving the clock rate and doubling the silicon area (double the parallelism), and few applications can afford to do this.
You don't give away speed if (a) you are prepared to leak, which you may be in dense arithmetic with high gate utilization per clock and (b) your clock is heat limited if you do not reduce V.
You can also do adaptive Vdd to maintain the same speed over process corners and temperature -- lower Vdd for fast hot chips, higher Vdd for slow cold chips -- and this also greatly reduces the chip-to-chip power variation (especially worst-case), but again many systems can't afford the complexity (and power supplies) to do this.
Yes. Vdd domains for different functional units. Higher V for low gate utilization, lower V for intense gate utilization. Even mobile chips are going to need this, so they will pay for the solution. We already have multiple Vdd domains in server and GPU chips, I have been told. The voltage needs of SRAM are an interesting problem, too.
 
Now that shrink is coming to an end, what are the best alternatives for improving performance? Layering, optical, faster speeds, different architectures, breaking inputs up for parallel processing? What companies will lead?

It is interesting to follow the Apple SoCs. The A16 Bionic uses TSMC N4P and has 16B transistors on a 112.75 mm2 die. The A17 Pro uses TSMC N3B and has 19B transistors on a 103.80 mm2 die. Yes there are a lot of factors to consider such as architecture and design but this shows that we are in fact shrinking.

It will be interesting to see what the first TSMC N2 Apple SoC looks like. Then we have CFETs to look forward to. Exciting times in the semiconductor industry, absolutely.
 
K change makes no difference, since gate capacitance is set by need to control the field in the gate. The point of high-K was to increase thickness to avoid tunneling while applying the same field to the gate.

GAA shortens the gate, which is nice for CPP reduction, but it adds a 4th side so C is pretty much the same. Which goes back to controlling the gate. I'm not sure if 2D material ribbons would be an improvement because controlling the gate is easier with a weaker capacitance? Have not seen it discussed in the research - processes are judged on density and they do care about STS for voltage, but rarely is Ceff discussed and (unlike size) it seems never quantified, treated as secret by the vendors. Which is not good for progress - what does not get measured (competitively) does not get improved.

You don't give away speed if (a) you are prepared to leak, which you may be in dense arithmetic with high gate utilization per clock and (b) your clock is heat limited if you do not reduce V.

Yes. Vdd domains for different functional units. Higher V for low gate utilization, lower V for intense gate utilization. Even mobile chips are going to need this, so they will pay for the solution. We already have multiple Vdd domains in server and GPU chips, I have been told. The voltage needs of SRAM are an interesting problem, too.
I meant lower-than-ELK dielectrics in the metal stack, since wiring is now where a lot of the capacitive load (and therefore power consumption) is.

For high-speed circuits which are on all the time you use fast but leaky ELVT transistors, but many applications have blocks which are not always used in all operating modes -- you can stop the clock to eliminate dynamic power but the leakage remains, especially if you've used ELVT transistors. On-chip power-rail switching sounds great until you try and do it, and increases power when the circuit is on due to voltage drop. Optimum mix of library cells (Vth and size) depends on activity/clock rate, we tend to use at least 3 or 4 different Vth in each block to optimise power/speed -- for example ULVT as "standard", swap in ELVT for fastest critical paths but keep numbers down because they leak like crazy, swap in ULVT_LL or LVT for non-critical paths -- and the mix will differ between blocks even with a common Vdd.

We use at least 3 different logic supplies for different purposes; low-voltage core for power-critical logic, separate higher-voltage supply for RAM and faster logic, even higher voltage for ultrafast logic in things like SERDES -- and some or all of these use adaptive voltage (on a per-chip basis) by controlling the SMPS so the circuits are "just fast enough". A lot of this is only possible with separate (and digitally trimmable) SMPS for each chip, which is OK in some cases but not for others where multiple chips share a supply rail -- plus you need a controller to monitor circuit performance and adjust the supply voltages dynamically.
 
That's what I meant when I said lower C isn't going to save us, it's not dropping very rapidly -- even lower-K dielectrics are unlikely, wires are getting more densely packed, gate capacitance is dropping a bit as smaller transistors give more drive but again this is low -- real channel length has been around 17nm ever since N7, N3 is almost the same, N2 with GAA might reduce it by a couple of nm (not sure). The biggest gains with each node now are density and these are mainly from DTCO not raw pitches, power savings are gradually decreasing with each node.

Operating voltage is the only thing left which can drastically reduce power, especially because of the square-law, but you give away speed in return for this. If power is the #1 priority and area/chip cost is of less concern, the optimum operating voltage is already less than 0.5V, you can get big power savings this way but the reduced speed means maybe halving the clock rate and doubling the silicon area (double the parallelism), and few applications can afford to do this.

You can also do adaptive Vdd to maintain the same speed over process corners and temperature -- lower Vdd for fast hot chips, higher Vdd for slow cold chips -- and this also greatly reduces the chip-to-chip power variation (especially worst-case), but again many systems can't afford the complexity (and power supplies) to do this.
I agree with your assessment. The power and heat issue will definitely become the core concern for further scaling. I still firmly believe the industry will eventually adopt measures to address these issues. For instance mitigating to more exotic transistor designs such as Tunnelling field-effect transistors in order to achieve below 60 mV/decade.
 
For instance mitigating to more exotic transistor designs such as Tunnelling field-effect transistors in order to achieve below 60 mV/decade.
Really challenging to try tricks like negative capacitance ferroelectrics or tunnelling combined with the complex fabrication of ribbon or CFET. These complex shrinks reward conservative devices. Whether the construction is ever mastered well enough to start adding gate trickery - maybe, but years and years from now.
 
I meant lower-than-ELK dielectrics in the metal stack, since wiring is now where a lot of the capacitive load (and therefore power consumption) is.
Already in use. More would be nice but if it was easy they already would. Shrinking the dimensions does not make such materials easier.
For high-speed circuits which are on all the time you use fast but leaky ELVT transistors, but many applications have blocks which are not always used in all operating modes -- you can stop the clock to eliminate dynamic power but the leakage remains, especially if you've used ELVT transistors. On-chip power-rail switching sounds great until you try and do it, and increases power when the circuit is on due to voltage drop. Optimum mix of library cells (Vth and size) depends on activity/clock rate, we tend to use at least 3 or 4 different Vth in each block to optimise power/speed -- for example ULVT as "standard", swap in ELVT for fastest critical paths but keep numbers down because they leak like crazy, swap in ULVT_LL or LVT for non-critical paths -- and the mix will differ between blocks even with a common Vdd.
I wonder if such diverse-circuit tactics will help much for AI chips where mostly they are grinding out low precision tensors and moving data.
We use at least 3 different logic supplies for different purposes; low-voltage core for power-critical logic, separate higher-voltage supply for RAM and faster logic, even higher voltage for ultrafast logic in things like SERDES -- and some or all of these use adaptive voltage (on a per-chip basis) by controlling the SMPS so the circuits are "just fast enough". A lot of this is only possible with separate (and digitally trimmable) SMPS for each chip, which is OK in some cases but not for others where multiple chips share a supply rail -- plus you need a controller to monitor circuit performance and adjust the supply voltages dynamically.
Sounds like fun! I assume BSPD will add another level of design opportunity/complexity here.
 
Back
Top