Memory versus Logic improvements in new process nodes

Chris9594 · Oct 15, 2021

I am doing research into the relative improvements of logic and memory portions of semiconductor design.

To date, I have an understanding that memory aspects of designs have not improved as fast as logic portions of designs.

Am looking for direction into articles or insights that could speak to this understanding / misunderstanding.

Thx!

mgoldsmith1979 · Oct 15, 2021

Chris9594 said:
I am doing research into the relative improvements of logic and memory portions of semiconductor design.

To date, I have an understanding that memory aspects of designs have not improved as fast as logic portions of designs.

Am looking for direction into articles or insights that could speak to this understanding / misunderstanding.

Thx!

Scotten's data is a good reference, there are other areas like Wikichip that may also include some details, but easiest for you to just Google the last 3-4 years of TSMC OIP / TS presentations and look at the scaling numbers.

The answer is fairly straight-forward: an SRAM bitcell requires 6 transistors to function ("6T"), and for FinFET architectures, the smallest possible bitcell is comprised where each of those 6T are a single fin in size (termed HDC111). Because they are mixed P/N type you get some variation in the spacing between fins, but typically these are pushed to the minimum (average) fin pitch, which in the case of say N7 is 30nm and S7 is 27nm. Contacted Poly Pitch scaling is the other dimension, and again, SRAM bitcell tends to use the minimum CPP available, even when a process may offer more than one (eg: N7 offers 57nm and 64nm for logic, but only 57nm for SRAM). So since the minimum size of a bitcell is fixed (in FinFET world), it scales only geometrically with the FP and CPP (and metal pitches, but those typically are not limiting).

Logic, on the other hand, when we talk about the height of a standard cell row, has been scaling more aggressively through fin depopulation. By that I mean, on N16 the "HD" library had 3P and 3N fins, with some space between those diffusion regions, while on N7 the "HD" library is 2P2N - so not only do you get the geometric scaling of FP and CPP, but also additional density improvement by removing how many fins are actually available in the library. That has meant that the logic scaling and density improvement for N16 --> N7 (as example) is greater than the SRAM scaling between the same nodes.

However, if you look at pure density (transistors / um2), the SRAM bitcell is still more dense than the logic library can be. mathematically, the HDC111 bitcell is ~8FP tall and 2* CPP wide, and has 6 transistors. Logic library smallest cell is an inverter, which would be 2 transitors in 3 CPP * 8FP (for the aforementioned HD library in N7) - this can be brought to 2 CPP with CPODE, which is expected as part of N6. Other end of the spectrum, a single-bit flipflop has ~32 transistors in ~20 CPP (which could be brought to 19 with CPODE); that's more a made-up value though for N7 as all the FF are double-row sized at 13CPP, so more like 32 xtor in 26PP - regardless of how you cut it, you can't get a logic density value that is higher than the SRAM.

There was a good presentation from Michael Wu at IEDM in 2019 where he laid out the relative n2n scaling of bitcell vs logic density in a normalized way, and his projections at the time were that logic density would catch up to SRAM in ~2022, so we might assume that N3 HD library will be as-dense in terms of xtor/um2 as the HDC111 SRAM bitcell.

Paul2 · Oct 16, 2021

Samsung always has a joker in its sleeve.

I believe they are by far the most advanced company when it comes to monolithic 3D in memory. Their 3D NAND RnD yielded V-NAND. Their SRAM, and RAM efforts were started at the same time, more than a decade ago. Imagine where they are now.

Fred Chen · Oct 23, 2021

Chris9594 said:
I am doing research into the relative improvements of logic and memory portions of semiconductor design.

To date, I have an understanding that memory aspects of designs have not improved as fast as logic portions of designs.

Am looking for direction into articles or insights that could speak to this understanding / misunderstanding.

Thx!

Memory by itself is a commodity function, not much to design. To keep bit cost down, its main focus is improving density.

If you are talking about the embedded memory on an SoC, there are ongoing developments where Flash and DRAM are being replaced by new memory technologies such as MRAM and RRAM, at foundries such as TSMC, UMC, GlobalFoundries, Samsung, etc. But those embedded memories on an SoC are much lower density than the standalone discrete Flash or DRAM dies.

Chris9594 · Nov 12, 2021

Thanks Fred - yes - I was inquiring on embedded memory and noted the new versions.

its been a minute, but

https://www.servethehome.com/amd-milan-x-scaling-to-0-75gb-of-l3-cache-per-chip/

More than L3 - I am digging into the efficiency / effectiveness of L1 - particularly in the area of hit rates and other bottlenecks.

VCT · Nov 12, 2021

How important is EUV for DRAM and NAND flash in the next 3-5 years?
Will lack of EUV for Chinese DRAM and NAND flash fabs hurt them badly or very little impact?

Fred Chen · Nov 12, 2021

VCT said:
How important is EUV for DRAM and NAND flash in the next 3-5 years?
Will lack of EUV for Chinese DRAM and NAND flash fabs hurt them badly or very little impact?

NAND never really used EUV since its patterning is basically lots of straight lines and now they have gone to 3D.

Although some EUV use has started at some DRAM companies, how far will it take them? Probably not much, since there are other issues limiting the scaling. Essentially, the buried word line pitch is severely sensitive to Rowhammer.

For DRAM, the capacitors are arranged as staggered pairs which is actually hard to achieve at current dimensions with a single exposure (too bean-shaped to allow a reasonable active area outline).

ChrisGar · Nov 13, 2021

Chris9594 said:
I am doing research into the relative improvements of logic and memory portions of semiconductor design.

To date, I have an understanding that memory aspects of designs have not improved as fast as logic portions of designs.

Am looking for direction into articles or insights that could speak to this understanding / misunderstanding.

Thx!

I think stacked die SRAMs will be an important development. AMD has publicly talked about it the most.

I'm assuming you meant SRAMs (e.g., embedded on processors) when you say memories. (and not DRAM/flash)

mgoldsmith1979 · Nov 15, 2021

Chris9594 said:
More than L3 - I am digging into the efficiency / effectiveness of L1 - particularly in the area of hit rates and other bottlenecks.

Architectural impacts are tied with these scaling differences. When you look at the "capacity" of I/D$ in terms of KiB or MiB, the difference for L1 vs L2 vs L3 has to do with latency and corresponding speed of the memory instances. You want single-cycle access time for your L1$, which (tends to) lead to smaller row/column sizes, and larger bitcells (122 vs 111) to give you faster access times. The array overhead then also becomes substantial (overhead meaning the row/column sense amps and etc, versus the actual array of bitcells) such that your I/D$ data arrays are close to 4:1 or 2:1 bitcell vs periphery area. If you can relax the timing (higher latency / more clock cycles) then you can increase the string length of the array, and also opt for slower bitcells, and reduce the overhead on a given instance, increasing the density (bits / mm^2) of the memory instance, but sacrificing performance. Large block arrays for L3$ are in ~32kiB per instance in current ARM CPU designs, versus L1 that might be ~4kiB each. Similar trends for the Tag RAM that needs to grow along with the capacity of the caches, or other parts of the core, where you may instantiate a register file instead of SRAM in order to get highest performance at expense of area. Real physical SRAM instances also can have additional bits for ECC, redundant r/c for yield recovery, etc. which can also contribute to the size overhead.

On hit rates, this depends a lot on the core architecture and what it is doing in support of SMT, prefetchers, etc. You may not benefit from larger L1$ sizes in the event you are supporting SMT2/4 because context switch between threads on the same core may invalidate parts of the cache. Conversely, larger L2 / victim L3$ may benefit SMT because you can evict those invalidated parts to the next level of hierarchy for when you switch context back, as opposed to doing full writeback to DRAM and having to fetch it again from there. In multi-threaded / multi-core, you also have cache coherency to deal with, and if you attempt to deal with that at the L1$ level, then you have this exponential problem on probe / snoop requests scaling with the # of cores, so you can simplify that with a directory to track which core has which cache line, and manage this through a common cache hierarchy level, which in case of AMD is the L3 (each CPU's L2 is private). Either use the directory to send probe request to the specific core that is holding the dirty cache line in L1 (which may impact the execution of that core), or have all dirty lines write-through to the L3 and have other cores update their (now) invalid cache lines from L3.
There is always some point where there is diminishing returns on increasing the cache size in each hierarchical level, but finding that point has to do with the machine width and types of workloads executing. Apple's CPUs for example have very large L1$ relative to industry peers because the amount of parallelism in their machine back-end means they need larger caches to 'feed the beast'. If you took something like an A76 and just doubled the L1$ size (without increasing the parallelism of the execution), you would not see a substantial uplift in performance, because the hit rate is already so high that the extra cache space would get filled with stale data, or instructions for branches that are never taken. And once you increase the address range by another bit, you might as well try to maximize that, which leads to power-of-2 doubling in most cases (but not all, hence 32 --> 48kiB, 128 --> 192kiB, etc sized caches). Core architects ultimately have to determine where the sweet-spot is in these implementations (typically though combination of modeling and workload experimentation).

TL;DR - you can pack more SRAM bits / mm^2 into L3$ because it is slower, has less periphery circuit overhead, than L1$s - but the performance benefit is up to the architecture and how that L3$ is used (in AMD's case, as a victim cache). A larger L1$ may not be as beneficial when also talking 8C coherent CCXs, than you might assume looking as just single-core performance.

Search

Memory versus Logic improvements in new process nodes

Chris9594

Guest

mgoldsmith1979

Guest

Paul2

Well-known member

Fred Chen

Moderator

Chris9594

Guest

VCT

Well-known member

Fred Chen

Moderator

ChrisGar

Active member

mgoldsmith1979

Guest