Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/index.php?threads/1-2-trillion-transistor-chip-yes-cerebras.11677/
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021370
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

1.2 Trillion transistor chip? Yes - Cerebras

Daniel Payne

Moderator
Yeah, the largest wafer-scale chip on the planet award goes to Cerebras at 1.2 trillion transistors. Aimed at speeding up ML operations, this is outrageously interesting. In theory as a die size increases the yield should approach zero, so how did Cerebras and TSMC team up to create this mammoth chip and even get it to yield?


chip-comparison.jpg
 
Daniel, could they be using most of the chip and have figured out how to disable or work around defective segments taking them out of the equation?
Any thoughts or comments on this would be appreciated. Could they have also designed some redundancy into the chip, not requiring perfection?
 
Arthur, we have to wait for Cerebras to disclose more details at the Hot Chips conference, so this is quite the accomplishment to actually get yield on a 16nm Wafer Scale Integration (WSI) project. At Intel there was research into WSI going back to 1978, but the equations predicted that you would get 0.0% yield for chips that covered the entire wafer. Most regular structures like DRAM, SRAM and Flash memory have redundancy built in, and this new Cerebras architecture has an array of 400,000 cores so it is likely to use redundancy to remove the non-yielding cores. It also has 18 GB of on-chip SRAM, likely also using redundancy.
 
Daniel, any thoughts on who the key equipment suppliers for this chip are, ASML, AMAT, etc. and why? Any other thoughts on the fabrication process would be appreciated. Has TSMC achieved a significant process breakthrough over the competition?
 
Daniel, any thoughts on who the key equipment suppliers for this chip are, ASML, AMAT, etc. and why? Any other thoughts on the fabrication process would be appreciated. Has TSMC achieved a significant process breakthrough over the competition?

The key supplier is TSMC and TSMC's key supplier is AMAT.

I remember hitting 1 billion transistors and thinking it was an amazing accomplishment. 1 trillion is just astounding. As I have said many times before AI will continue to push leading edge processes due to the speed and density requirements. This is just one of many examples to come, absolutely.
 
Cerebras used a TSMC 16nm process, but somehow they worked to increase the reticle size beyond 32mm x 26mm to fill an entire wafer, or it's likely that they did multiple exposures at the maximum reticle size. I really haven't heard of other fabs doing full-wafer chips because the yield approaches 0 as the die size increases. Visit this site to predict die yield, http://www.isine.com/resources/die-yield-calculator and place Width=215mm, Height 215mm, Wafer Diameter=300mm, and it predicts that the yield = 0%.

37
 
Last edited:
Cerebras used a TSMC 16nm process, but somehow they worked to increase the reticle size beyond 32mm x 26mm to fill an entire wafer, or it's likely that they did multiple exposures at the maximum reticle size. I really haven't heard of other fabs doing full-wafer chips because the yield approaches 0 as the die size increases.

It's easy really: there are 7x12 independent chiplets which need wire bonding to be connected, so you just use the ones that work and route around the ones that don't. Each chiplet may also have redundant blocks but getting 72 working ones out of 84 should be no problem on a mature process.
 
It's easy really: there are 7x12 independent chiplets which need wire bonding to be connected, so you just use the ones that work and route around the ones that don't. Each chiplet may also have redundant blocks but getting 72 working ones out of 84 should be no problem on a mature process.
Yes, chiplets is an approach, however this is never mentioned in any of the public Cerebras information. Do you have some private insight?
 
TSMC has definitely shown they are into innovation and thinking outside the box. I look forward to more surprising developments in the near future. Any thoughts on where TSMC might innovate would be appreciated. The competition, only Samsung is left, will be looking.
 
OK, just read the article at https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning, where it shows die mounted to a substrate, now that makes sense.

I don't interpret the slides meaning that they mount different dies on a substrate. I do interpret the slides they are really doing wafer scale chip and needed interconnects over the scribe line to achieve that. I don't interpret the slides that they use wire bonding or an interposer for the interconnect between the dies as that would use too much power. The connections to the PCB seems to me only for the signals that have to go off-chip. To me it seems that the configuration to route around faulty cores is done digitally and not physically.
I know so-called stitching processes have been used in the past for image sensors bigger than the reticle size; it seems that that now has been used for a digital process.
 
Last edited by a moderator:
Here is another view of the chip from IEEE Spectrum:

6 Things to Know About the Biggest Chip Ever Built

1 | The stats

As the largest chip ever built, Cerebras’s Wafer Scale Engine (WSE) naturally comes with a bunch of superlatives. Here they are with a bit of context where possible:

Size: 46,225 square millimeters. That’s about 75 percent of a sheet of letter-size paper, but 56 times as large as the biggest GPU.

Transistors: 1.2 trillion. Nvidia’s GV100 Volta packs in 21 billion.

Processor cores: 400,000. Not to pick on the GV100 too much, but it has over 5000 CUDA cores and more than 600 tensor cores, both of which are used in AI workloads. (Fifty-six GV100s would then have more than 300,000 cores.)

Memory: 18 gigabytes of on-chip SRAM. Cerebras says this is 3,000 times as much as the GPU. But this is probably not a fair comparison as each GV100 works with 32 GB of high-bandwidth DRAM.

Memory bandwidth: 9 petabytes per second. According to Cerebras, that’s 10,000 times our favorite GPU, but it’s hard to see what the startup is measuring to get that number. As one reader pointed out, this analysis gave one part of the GV100’s SRAM (the L2 cache) a throughput of 2155 GB/s, easily cutting Cerebras supposed lead in half.

2 | Why do you need this monster?

 
I am not sure this fits in the traditional definition of monolithic wafer scale integration (WSI). However, if one uses wafer scale interconnect to provide inter die connection, then this can be loosely called wafer scale integration. The following proposal could work:
1. The recticle dies are tested and repaired, just like the NVDA big chips.
2. On a separate wafer fabricate the interconnect signal carrier, the bridging wires between die, power vias to top and bottom and possible clock distribution, pads.
3. Mount recticle dies on wafer scale interconnect using flip chip technology.
4. Hold dies in position and provide mechanical support.
5. Etch off silicon (interconnect wafer carrier)

This will allow the use of step and repeat for the recticle die, projection full wafer aligner for the full wafer scale interconnect holding in place on a rigid wafer till die flip chip attach. After the silicon is etched off, the dies are free to move relative to each other on a stretchable membrane interfacing to the PCB.
How complex is the signal carrier depends very much on the architecture and clocking scheme. If it is done this way, it is a form of MCM. The signal carrier provides the I/O density matching, power connections between the dies and PCB.
Things get a lot more complicated if it consumes high power, say a few KWs and has a lot of I/O.
 
2. On a separate wafer fabricate the interconnect signal carrier, the bridging wires between die, power vias to top and bottom and possible clock distribution, pads.

What I understood is that the interconnect was not done in a separate wafer because of power considerations; normal metal layers of the process were used and arrangement with TSMC to allow having interconnect over the scribe line structure.
 
What I understood is that the interconnect was not done in a separate wafer because of power considerations; normal metal layers of the process were used and arrangement with TSMC to allow having interconnect over the scribe line structure.
That was what I thought in the beginning when I saw WSI, but in the assembly drawing, recticle dies were in the drawing to manage the TCE mismatch between silicon and PCB. There are multiple problems need to be solved at the same time, the presentation does not contain enough information to show how it was solved.
Connecting between die using interconnect bridges is the easy part, but there are other things like clock distribution circuitry etc needed to make this work. In WSI, if the interconnect is too losy, signal propagation speed will be limited by the interconnect RC time constant.
 
Back
Top