Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/threads/cisco-launched-its-silicon-one-g300-ai-networking-chip-in-a-move-that-aims-to-compete-with-nvidia-and-broadcom.24521/page-3
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2030970
            [XFI] => 1060170
        )

    [wordpress] => /var/www/html
)

Cisco launched its Silicon One G300 AI networking chip in a move that aims to compete with Nvidia and Broadcom.

Can you point to one of these analyses?

I think you've already decided the answer you will believe, so it looks like asking this question here is not going to be productive. Do you believe GPUs, even with tensor cores and specialized interconnects like NVLink, are an optimal answer for AI training?
At the first order these problems seem to be solved, however, it increases the cost of the system significantly as there are special materials, techniques and tools developed to address these.
The ISO space performance/watt numbers of the CS-3-based systems is better than B200 based systems. However, the ISO space performance/watt/$ performance is much compared to B200 systems and it is evident from the above discussions that the higher cost of solving problems associated with wafer-scale chips are contributing to it.
This is comparing WSA-3 with B200 which is of course no longer the state-of-the-art:

This means models up to ~40 billion parameters (with 16-bit weights) can fit entirely on-chip, enabling them to be run without ever touching off-chip memory. This is a huge advantage: each core gets single-cycle access to weights and activations
When models exceed 44 GB, Cerebras uses a technique called Weight Streaming — the model parameters reside in external MemoryX cabinets (which can be many TB), and the wafer streams in the needed weights each layer.
The catch is that when streaming from off-chip, performance depends on that external memory bandwidth and the sparsity/pattern of weight access.

There's no doubt that the Cerebras solution has fantastic performance so long as the models fit in the WSE memory, but it takes a hit when it doesn't -- and the problem I see here is the almost exponentially increasing size of AI databases, in cases when low-latency memory access is needed. With other solutions it's a lot easier to get a lot more HBM memory close to the NPUs, and this is then considerably faster than the Cerebras external memory for models which fit in HBM. Once they're too big even for this, the playing field levels out again, in fact Cerebras may have an advantage by having fewer levels of memory heirarchy.

But the other issue with Cerebras is cost -- so not performance but performance per dollar. Here the specialized custom hardware (and much lower volumes) put it at a considerable disadvantage.

If neither of these is correct, why has Cerebras not taken over the entire AI world and wiped out the competition?

Your last question -- nope, absolutely not -- they do a good job, but for sure something more exactly tailored to the task could do it better. But then it would also have less flexibility, the classic custom ASIC problem... ;-)
 
Do you believe GPUs, even with tensor cores and specialized interconnects like NVLink, are an optimal answer for AI training?
My personal opinion is that as models evolve and improve, elements of both training and inference are becoming increasingly specialized, requiring tons of flexibility in interconnect, memory configuration/hierarchy and malleability of heavy compute resources. With the current generation of MoE disaggregated transformer models, there isn't just a single stage/process for training or inference - there are numerous stages requiring different mixes of compute/memory/io resources. For instance, training is really a 3-stage process of pretraining → Supervised fine‑tuning (SFT) → Reinforcement Learning with Human Feedback (RLHF) / preference‑based post‑training.

For a frontier MoE LLM, a typical split might look like this in terms of FLOPs (not counting data or human costs):
• Pretraining: ~80–95% of total FLOPs, even with MoE efficiency gains.
• SFT: a few percent (order 1–10% depending on scale and number of passes).
• RLHF / preference tuning (including reward model): also a few percent, sometimes similar to or somewhat larger than SFT, but still far below pretraining.

MoE substantially reduces the FLOPs needed for pretraining vs dense model training for a given capability level but also introduces the need for sophisticated routing balancing. And if you're a Chinese model like DeepSeek, there might be a big "distilling someone else's model step" ;)

I do think that models are also somewhat evolving, both from a training and inference perspective, to take best advantage of current and planned GPU hardware, except perhaps Gemini which is better tuned for TPU systems. But that evolution means that any piece of highly dedicated hardware for models today, would likely be DOA by the time the silicon made it into working systems, 3-4 years later.
 
At the first order these problems seem to be solved, however, it increases the cost of the system significantly as there are special materials, techniques and tools developed to address these.
The ISO space performance/watt numbers of the CS-3-based systems is better than B200 based systems. However, the ISO space performance/watt/$ performance is much compared to B200 systems and it is evident from the above discussions that the higher cost of solving problems associated with wafer-scale chips are contributing to it.
This is comparing WSA-3 with B200 which is of course no longer the state-of-the-art:

This means models up to ~40 billion parameters (with 16-bit weights) can fit entirely on-chip, enabling them to be run without ever touching off-chip memory. This is a huge advantage: each core gets single-cycle access to weights and activations
When models exceed 44 GB, Cerebras uses a technique called Weight Streaming — the model parameters reside in external MemoryX cabinets (which can be many TB), and the wafer streams in the needed weights each layer.
The catch is that when streaming from off-chip, performance depends on that external memory bandwidth and the sparsity/pattern of weight access.

There's no doubt that the Cerebras solution has fantastic performance so long as the models fit in the WSE memory, but it takes a hit when it doesn't -- and the problem I see here is the almost exponentially increasing size of AI databases, in cases when low-latency memory access is needed. With other solutions it's a lot easier to get a lot more HBM memory close to the NPUs, and this is then considerably faster than the Cerebras external memory for models which fit in HBM. Once they're too big even for this, the playing field levels out again, in fact Cerebras may have an advantage by having fewer levels of memory heirarchy.

But the other issue with Cerebras is cost -- so not performance but performance per dollar. Here the specialized custom hardware (and much lower volumes) put it at a considerable disadvantage.

If neither of these is correct, why has Cerebras not taken over the entire AI world and wiped out the competition?

Your last question -- nope, absolutely not -- they do a good job, but for sure something more exactly tailored to the task could do it better. But then it would also have less flexibility, the classic custom ASIC problem... ;-)
Your two references are weak.

Nether one of them contains any deep architecture discussion, implementation discussions or comparisons, and no measurements. The first is written by a "Go To Market" company. ?? I do agree with the concerns about Cerebras packaging, cooling, and power requirements, which is why I have mentioned before that I think their immediate future lies in cloud-based access. There's no significant information in the article about MemoryX architecture or how it relates to training flows for very large models. I don't see how it supports your hypothesis at all.

The second paper, a Medium post actually, was better, but it still made me chuckle, perhaps more than the first. The author refers to the Google TPU as a "Broadcom ASIC", and groups it with the AWS Tranium? That's ridiculous. The post contains some nicely written high-level discussions about the processor architecture of the several products it discusses, but the discussions are high level, and don't specifically support or not support the assertion you're making about Cerebras off-chip MemoryX access with analysis or data.

While this paper does not examine the performance of a Cerebras multi-node system, it is more of the calibre of paper I look for in product analysis:


Outside of a possible G42 system, I have serious doubts that there are any Cerebras MemoryX installations in customer sites. (I can't find any references to one.) Perhaps when one is installed and measured, we'll see some published results. Until then it is difficult to draw supported conclusions.
 
My personal opinion is that as models evolve and improve, elements of both training and inference are becoming increasingly specialized, requiring tons of flexibility in interconnect, memory configuration/hierarchy and malleability of heavy compute resources. With the current generation of MoE disaggregated transformer models, there isn't just a single stage/process for training or inference - there are numerous stages requiring different mixes of compute/memory/io resources. For instance, training is really a 3-stage process of pretraining → Supervised fine‑tuning (SFT) → Reinforcement Learning with Human Feedback (RLHF) / preference‑based post‑training.

For a frontier MoE LLM, a typical split might look like this in terms of FLOPs (not counting data or human costs):
• Pretraining: ~80–95% of total FLOPs, even with MoE efficiency gains.
• SFT: a few percent (order 1–10% depending on scale and number of passes).
• RLHF / preference tuning (including reward model): also a few percent, sometimes similar to or somewhat larger than SFT, but still far below pretraining.

MoE substantially reduces the FLOPs needed for pretraining vs dense model training for a given capability level but also introduces the need for sophisticated routing balancing. And if you're a Chinese model like DeepSeek, there might be a big "distilling someone else's model step" ;)

I do think that models are also somewhat evolving, both from a training and inference perspective, to take best advantage of current and planned GPU hardware, except perhaps Gemini which is better tuned for TPU systems. But that evolution means that any piece of highly dedicated hardware for models today, would likely be DOA by the time the silicon made it into working systems, 3-4 years later.
agree the AI HW needs will be a lot mor diversified, depending on different stage mentioned here, and even perhaps multiple different foundational models (not just all LLM based). So in a way, whoever can master the process of putting out such customized silicon in a radically faster pace will win.
 
So in a way, whoever can master the process of putting out such customized silicon in a radically faster pace will win.
Two thoughts:
* agree that we will see new specialized foundational frontier models, that have very different architectures from the current MoE transformers. But MoE transformers are proving to be far more generalizable than their original text-based forerunners. I’m seeing new multimodal variants (image, video) and co-optimization in the pipeline. My view is that GNNs will eventually enter the fray.
* I’m betting that software-controlled flexible hardware with small granularity heavily interconnected accelerators will evolve faster for co-optimization than dedicated hardware.
 
Just out of curiosity because of the controversy over Cerebras CS-3 system clustering scalability, I began to poke around with some internet searches. One fascinating link that showed up was a programming guide portion of the Cerebras website dedicated to CS-3 components, and how to build training applications specifically for the CS-3. Their documentation is surprisingly frank, especially for publicly available information, and deeply technical. Too technical for me, for example, with specifics of PyTorch programming, and forced me to educate myself further. For those with insatiable curiosity, or trouble getting to sleep, I recommend this section of the Cerebras website:


Unfortunately, I haven't found any cluster system scaling measurements yet, but embedded in the text you'll see bravado about how near linear the performance scaling is. Hmmm. I continue to be skeptical about how many clustered systems have been built, because the MemoryX and SwarmX nodes still appear to be based on networking technology that is multiple generations old (100Gb/s Ethernet). Or, I suppose, it is possible that MemoryX operations are latency-sensitive and not throughput constrained, because the weight data is not large.
 
Just out of curiosity because of the controversy over Cerebras CS-3 system clustering scalability, I began to poke around with some internet searches. One fascinating link that showed up was a programming guide portion of the Cerebras website dedicated to CS-3 components, and how to build training applications specifically for the CS-3. Their documentation is surprisingly frank, especially for publicly available information, and deeply technical. Too technical for me, for example, with specifics of PyTorch programming, and forced me to educate myself further. For those with insatiable curiosity, or trouble getting to sleep, I recommend this section of the Cerebras website:


Unfortunately, I haven't found any cluster system scaling measurements yet, but embedded in the text you'll see bravado about how near linear the performance scaling is. Hmmm. I continue to be skeptical about how many clustered systems have been built, because the MemoryX and SwarmX nodes still appear to be based on networking technology that is multiple generations old (100Gb/s Ethernet). Or, I suppose, it is possible that MemoryX operations are latency-sensitive and not throughput constrained, because the weight data is not large.
I don't think there's any doubt about Cerebras scaleability, or the excellent performance when the problem fits "on-wafer" since there's a massive pool of NPUs and SRAM with ridiculous interconnect bandwidth and low latency.

The question is what happens when it *won't* fit on-wafer, and specifically because for a given amount of NPU processing power there's a *lot* less close-cache memory than there is with the HBM stacks in conventional architectures -- I haven't crunched the numbers, but given SRAM-vs-HBM density/die area and HBM stack height I suspect the HBM is bigger by at least 10x, which means at least a 10x bigger model will fit. The HBM latency and bandwidth isn't as good as the Cerebras SRAM -- and in either case once you get past the capacity of SRAM/HBM you have to go to off-board memory which is much slower and higher latency, same for either system. You can try and hide the latency/bandwidth but this is only going to work sometimes.

That suggests to me that up to a x1 model size (which will fit into Cerebras on-wafer SRAM) Cerebras will have a considerable performance advantage, which is what the benchmarks show.

From this model size up to maybe 10x (or more? -- see below) a conventional architecture using HBM should be considerably faster than Cerebras which has to go off-board to fetch data.

Above 10x size both architectures should be similar, assuming similar mass memory storage and comms links to it -- what one can do, so can the other.

So the question is -- where do the AI model sizes sit today compared to these 3 regions, and where will they go in future? It seems that the size is rapidly increasing, which you'd think means more and more will not sit in the Cerebras sweet spot any more but will move into the region where HBM wins. And it's *much* easier to add more HBM into a system (if you can get it, of course...) than it is to expand the Cerebras SRAM (which has stopped scaling!), so if the gap in local memory is 10x now (is it? or is it 20x with newer HBM?) it wouldn't be difficult to double this, or even more in future since HBM is still scaling but SRAM isn't.

==> Is there any genuine data out there which answers this question?

On top of that, even if Cerebras win in performance they have a distinct disadvantage in performance per dollar given the exotic system construction -- so maybe they win in cases where absolute performance matters more than cost (government-funded labs?), but only get a small fraction of the overall AI market which -- lets face it -- is going to be heavily cost-driven (including power consumption).

And once the model is too big to even fit in local HBM any more, performance should be a wash but Cerebras will be at a considerable cost disadvantage -- which means, dead in the water.

Am I missing something?
 
I don't think there's any doubt about Cerebras scaleability, or the excellent performance when the problem fits "on-wafer" since there's a massive pool of NPUs and SRAM with ridiculous interconnect bandwidth and low latency.
I think you raise a bunch of valid questions about limits of performance vs model size and cost. Plus your earlier question was great as well - how big is the market for super fast response time (low latency with high token rates) at high TCO. I think the real question is whether Cerebras can confirm their architecture to support the system-level model/system optimizations happening in data center level work today. It’s a positive that they are operating their own data centers so there is cost/performance pressure to directly deliver (no middlemen). But we don’t have SemiAnalysis-like benchmarks to give us greater insight yet.
 
I don't think there's any doubt about Cerebras scaleability, or the excellent performance when the problem fits "on-wafer" since there's a massive pool of NPUs and SRAM with ridiculous interconnect bandwidth and low latency.

The question is what happens when it *won't* fit on-wafer, and specifically because for a given amount of NPU processing power there's a *lot* less close-cache memory than there is with the HBM stacks in conventional architectures -- I haven't crunched the numbers, but given SRAM-vs-HBM density/die area and HBM stack height I suspect the HBM is bigger by at least 10x, which means at least a 10x bigger model will fit. The HBM latency and bandwidth isn't as good as the Cerebras SRAM -- and in either case once you get past the capacity of SRAM/HBM you have to go to off-board memory which is much slower and higher latency, same for either system. You can try and hide the latency/bandwidth but this is only going to work sometimes.
To make my position completely clear, since you seem confused about it, the only scalability I'm referring to in my responses are multi-WSE-3 configurations linked though SwarmX and MemoryX. Scalability within a single WSE-3 is a given in my mind.

Comparing Cerebras's memory hierarchy to Nvidia's and AMD's HBM usage seems very complex. For example, as you mentioned, the WSE-3 SRAM size is 44GB. Each Blackwell GPU has 192GB or 288GB of HBM, depending on it being the Ultra version (two-die) or not. And that's just for one Blackwell. HBM, as you probably know, is high bandwidth, but has greater latency to the first byte of an access than DDR5. While, of course, the WSE-3 SRAM has latency claimed to be lower than 1ns, so probably 1/100th the latency of HBM. The other difference is that Blackwell keeps several different types of data in HBM, including KV caches, intermediate results, and collectives. Cerebras insists SwarmX and MemoryX are only used for sharing weights between WSE-3s, which not only confuses me, but tells me the processing model for the WSE-3 is completely different than what Nvidia does in its GPUs. I haven't found sufficient architecture details on WSE-3 processing flows in a form I can comprehend, though there is detailed processing flow descriptions in their 1.4.0 SDK, which I haven't studied yet.


That suggests to me that up to a x1 model size (which will fit into Cerebras on-wafer SRAM) Cerebras will have a considerable performance advantage, which is what the benchmarks show.

From this model size up to maybe 10x (or more? -- see below) a conventional architecture using HBM should be considerably faster than Cerebras which has to go off-board to fetch data.

Above 10x size both architectures should be similar, assuming similar mass memory storage and comms links to it -- what one can do, so can the other.

So the question is -- where do the AI model sizes sit today compared to these 3 regions, and where will they go in future? It seems that the size is rapidly increasing, which you'd think means more and more will not sit in the Cerebras sweet spot any more but will move into the region where HBM wins. And it's *much* easier to add more HBM into a system (if you can get it, of course...) than it is to expand the Cerebras SRAM (which has stopped scaling!), so if the gap in local memory is 10x now (is it? or is it 20x with newer HBM?) it wouldn't be difficult to double this, or even more in future since HBM is still scaling but SRAM isn't.
Given the fundamental differences in the use models for the memory hierarchies of the WSE-3 and Blackwell, and the highly contrasted technology differences, I can't answer this question yet. There are only two possibilities: you have a fundamental misunderstanding about how the WSE-3 works in a multi-node configuration, or Cerebras is lying about inter-WSE-3 scaling efficiency. I don't see a middle ground.
==> Is there any genuine data out there which answers this question?
Not that I've found yet. I'm thinking about just asking Cerebras directly.
On top of that, even if Cerebras win in performance they have a distinct disadvantage in performance per dollar given the exotic system construction -- so maybe they win in cases where absolute performance matters more than cost (government-funded labs?), but only get a small fraction of the overall AI market which -- lets face it -- is going to be heavily cost-driven (including power consumption).

And once the model is too big to even fit in local HBM any more, performance should be a wash but Cerebras will be at a considerable cost disadvantage -- which means, dead in the water.

Am I missing something?
I think Cerebras does have a disadvantage in requiring exotic system construction, housing, and maintenance. (I like the term "exotic".) We also don't know what their systems really cost. Like, how much are Sandia Labs and G42 really paying for their WSE-3 hardware? Since Cerebras is private, we also don't know how much money they're losing.
 
OK @blueone and @IanD,
I cannot claim this as my own work, though I did carefully target the questions and pored over the result to make sure it made sense, even though it involved some extrapolation and inference from what's actually documented.

Cerebras handles MoE and KV-heavy transformers mainly by leaning on its weight‑streaming architecture (MemoryX + SwarmX + WSE) rather than the GPU‑style prefill/decode disaggregated serving stacks you see with vLLM/Dynamo/Neuron.[1][2][3]

## MoE on Cerebras

- Cerebras treats MoE as “just another” transformer variant: experts live in the FFN slots, and the compiler maps expert FFNs across the wafer’s cores while the router and gating are implemented in the same graph as dense models.[4][5][6]
- Because parameters are streamed from MemoryX rather than stored on device, you can scale total MoE parameter count (experts × depth) without being bound by on‑wafer SRAM, similar to how they handle multi‑trillion‑parameter dense models.[2][3][1]
- SwarmX + MemoryX keep the system in strict data parallel; a batch of tokens is sharded across CS‑3s, and the MoE routing decisions are local to each shard, so you don’t need custom “expert parallel” routing fabrics as on GPUs.[3][4]

### What this means in practice

- Expert sparsity (top‑k experts per token) reduces *compute* per token, but Cerebras still sees weights layer‑by‑layer via streaming; the main benefit is that fewer expert FFNs are instantiated per token on‑wafer at a time.[3][4]
- The routing network and load‑balancing losses are all handled in the Cerebras compiler graph; debugging tools they ship for “dead experts”/load skew are built into their training/inference workflow rather than a separate serving layer.[7][4]

## Disaggregation vs GPUs

- Cerebras already **disaggregates parameters from compute**: MemoryX is effectively a big parameter server, WSE is the compute plane, and SwarmX is the broadcast/reduce fabric, so model storage is physically separate from compute nodes.[8][2][3]
- However, they do *not* publicly describe GPU‑style **prefill/decode disaggregation** where prefill and decode are run on different workers and KV cache is shipped over the network, as in Neuron “disaggregated inference” or Dynamo‑like designs.[9][10][11]
- Instead, the prefill and decode phases of autoregressive generation both execute on the WSE that owns the active sequence, with the same weight‑streaming machinery; the system’s disaggregation boundary is “parameters vs compute,” not “prefill vs decode.”[2][3]

### Comparing disaggregation styles

| Aspect | Cerebras WSE + MemoryX/SwarmX | GPU disaggregated serving (Neuron, etc.) |
|--------------------------------|-------------------------------------------|-----------------------------------------------------|
| Main disaggregation boundary | Parameters vs compute | Prefill vs decode workers |
| Where KV cache lives | On WSE SRAM per sequence | On decode workers’ GPU memory |
| Cross‑node traffic focus | Streaming weights, gradient reduce | Shipping KV cache between prefill/decode workers |
| Parallelism model | Strict data parallel over replicas | Data + tensor + expert + PD disaggregation |
| MoE scaling focus | Streaming huge expert sets from MemoryX | Routing tokens across expert GPUs + KV movement |
[11][4][8][2][3]

## KV cache handling

- KV cache in a Cerebras transformer still lives in the accelerator’s on‑wafer SRAM during inference; there’s no public description of offloading KV to MemoryX or doing cross‑node KV shipping the way PD disaggregation frameworks do.[11][2][3]
- Instead, the architecture reduces KV pressure by attacking *attention itself*: they show work on sparse attention that halves KV memory by using mostly sparse patterns and only a minority of dense layers in a modified Llama‑style decoder.[12][13]
- Since compute is extremely abundant on the wafer and memory bandwidth is on‑wafer, Cerebras can also afford schemes where recomputation or sparse attention patterns trade a bit of extra math for much lower KV footprint.[13][2]

### Disaggregated transformer implications

- For “disaggregated transformer” ideas (huge off‑chip weights, on‑chip activations/KV) Cerebras is already there: weight streaming makes the transformer effectively disaggregated at the parameter level, but **KV and activations are intentionally kept local** to avoid network latency in the inner loop.[1][2][3]
- If you imagine a DeepSeek‑style stack on Cerebras, the analogue would likely be: experts and dense layers all live in MemoryX, WSE runs prefill+decode on‑wafer with sparse attention to compress KV, and scale‑out is purely via data parallel replicas rather than explicit PD disaggregation and KV shipping.[14][13][3]

Sources
[1] tell me about weight streaming on cerebras https://www.perplexity.ai/search/6d16c517-a056-4ba7-a335-56db3047a1dd
[2] Weight Streaming Execution - Cerebras AI https://training-docs.cerebras.ai/rel-2.5.0/concepts/weight-streaming-execution
[3] Linear Scaling Made Possible with Weight Streaming - Cerebras https://www.cerebras.ai/blog/linear-scaling-made-possible-with-weight-streaming
[4] MoE at Scale: Making Sparse Models Fast on Real Hardware https://www.cerebras.ai/blog/moe-guide-scale
[5] MoE Fundamentals: Why Sparse Models Are the Future of AI https://www.cerebras.ai/blog/moe-guide-why-moe
[6] [PDF] MoEfication: Transformer Feed-forward Layers are Mixtures of Experts https://aclanthology.org/2022.findings-acl.71.pdf
[7] Debugging Dead MoE Models: A Step-by-Step Guide - Cerebras https://www.cerebras.ai/blog/moe-guide-debug
[8] what does scale-up ans scale-up look like for cerebras ? https://www.perplexity.ai/search/10cd6ef1-7c82-45c8-bcec-1bf89dce3758
[9] Tell me about NVIDIA dynamo and multi-headed attention with diaggregation https://www.perplexity.ai/search/9ea90ed3-2196-4d89-aff0-c9819d1b8937
[10] Disaggregated inference https://docs.modular.com/mammoth/disaggregated-inference/
[11] Disaggregated Inference [BETA] — AWS Neuron Documentation https://awsdocs-neuron.readthedocs-...developer_guides/disaggregated-inference.html
[12] Mixture-of-Experts (MoE) LLMs - by Cameron R. Wolfe, Ph.D. https://cameronrwolfe.substack.com/p/moe-llms
[13] Compressing KV cache memory by half with sparse attention https://www.cerebras.ai/blog/compressing-kv-cache-memory-by-half-with-sparse-attention
[14] plese provided more details about disaggregation, KV cache, multi-token predication and communication optimizations https://www.perplexity.ai/search/e7adce4c-eccc-4b5a-9b4f-b244ed5fcf23
[15] MoE-Inference-Bench: Performance Evaluation of Mixture of Expert ... https://arxiv.org/html/2508.17467v1
[16] Prefill-decode disaggregation | LLM Inference Handbook https://bentoml.com/llm/inference-optimization/prefill-decode-disaggregation
[17] How to Build, Train & Debug MoE Models in 2025 - YouTube
[18] Disaggregated Prefill and Decoding Inference System for Large ... https://arxiv.org/abs/2509.17542
[19] Learn Mixture of Experts (MoE) with our new series - LinkedIn https://www.linkedin.com/posts/cerebras-systems_moe-activity-7353453824520404993-lWWg
 
Last edited:
@IanD and @blueone, the second bit.

Cerebras buys simplicity and bandwidth by *not* having a big, coherent, multi‑processor KV/memory hierarchy—but it also gives up several capabilities that GPU‑style, cache‑centric clusters are starting to exploit.[1][2][3]

### 1. No cross‑device KV sharing

Because KV cache is local to a single WSE and there is no documented, coherent KV fabric across CS‑3s, Cerebras cannot:
- Share a long‑context KV across many concurrent decoders the way disaggregated inference systems do (e.g., one prefill feeding many decode workers).[2][4][5]
- Reuse KV across requests *on different devices* for prefix caching or multi‑tenant fan‑out; reuse is essentially constrained to what fits on one wafer.[1][2]

On GPUs with a KV‑aware serving layer, you can run prefill once, then route many follow‑ups to different decode workers that all see a shared KV pool; Cerebras’ data‑parallel pods don’t expose that kind of KV‑coherence abstraction.[4][5][6]

### 2. Limited global memory hierarchy tricks

Cerebras has a sharp split: on‑wafer SRAM for activations/KV, and off‑wafer MemoryX for weights; there isn’t a multi‑level, shared, device‑coherent cache hierarchy (HBM tiers, host DRAM tiers, NVMe tiers) tuned around KV the way large GPU systems are evolving.[3][7][8]

That means it misses out on:
- Fine‑grained KV offload/bring‑back policies across devices and tiers (HBM ↔ host ↔ NVMe) that let GPU stacks push sequence lengths or batch sizes beyond single‑device KV capacity.[6][9][2]
- Cross‑model or cross‑session KV/page caching: a GPU cluster can, in principle, treat KV like pages in a distributed cache and keep hot prefixes in a multi‑node hierarchy; WSE‑3 does not present that abstraction—it just has very fast local SRAM.[10][2][1]

### 3. Less flexibility for exotic KV‑heavy workloads

The wafer is fantastic when most of the action is “do a lot of math on data that fits in 44 GB SRAM,” but it is less ideal when the *primary* challenge is orchestrating huge KV graphs across many agents or tools. For example:[11][12][1]
- Multi‑agent, tool‑using workloads where many agents share a large, evolving world‑state KV over long horizons benefit from a global KV/memory fabric; Cerebras mostly assumes per‑wafer isolation with data‑parallel replication.[10][1]
- Workloads that want aggressive KV pooling across models (e.g., shared retrieval contexts, shared document windows for dozens of micro‑models) are easier to host on a cluster that treats KV as a first‑class distributed object.[2][6]

Cerebras partly compensates by attacking KV size directly (sparse attention, memory tokens, sensitivity‑based selection of dense vs sparse layers), cutting KV memory nearly in half for Llama‑style models. That’s very good for “one big sequence per wafer,” but not a substitute for a fabric that can move and share KV arbitrarily across many processors.[2]

### 4. Constraints on disaggregated prefill/decode

Because there’s no massive coherent KV hierarchy across CS‑3s, prefill and decode are not cleanly disaggregated across multiple wafers with KV shipping between them in the way Neuron/Dynamo‑style systems do.[5][3]

You lose:
- The ability to run prefill on a large, bursty pool and then hand off only KV to a small, latency‑tuned decode pool across the cluster.[5][6]
- Some of the elasticity and autoscaling patterns where KV is the unit of work moved around the fleet.

Instead, Cerebras relies on weight‑streaming disaggregation (parameters in MemoryX, compute+KV on wafer) and keeps KV local, which simplifies programming and gives terrific per‑device efficiency, but doesn’t expose the emergent “KV‑as‑a‑service” semantics that a massive coherent multi‑processor hierarchy could.[8][13][3]


Sources
[1] Cerebras Wafer-Scale Engine Overview - Emergent Mind https://www.emergentmind.com/topics/cerebras-wafer-scale-engine-wse
[2] Compressing KV cache memory by half with sparse attention https://www.cerebras.ai/blog/compressing-kv-cache-memory-by-half-with-sparse-attention
[3] A Comparison of the Cerebras Wafer-Scale Integration Technology ... https://arxiv.org/html/2503.11698v1
[4] Cerebras WSE-3: A New Frontier for AI Computation - LinkedIn https://www.linkedin.com/posts/yuga...ras-wse3-ai-activity-7359230594641268737-q9UV
[5] Disaggregated Inference [BETA] — AWS Neuron Documentation https://awsdocs-neuron.readthedocs-...developer_guides/disaggregated-inference.html
[6] Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on ... https://arxiv.org/html/2512.16473v1
[7] Cerebras Wafer-Scale Engine Overview - Emergent Mind https://www.emergentmind.com/topics/cerebras-wafer-scale-engine
[8] Weight Streaming Execution - Cerebras AI https://training-docs.cerebras.ai/rel-2.5.0/concepts/weight-streaming-execution
[9] Accelerating Mixture-of-Experts Inference by Hiding Offloading ... https://arxiv.org/html/2508.21706v1
[10] Right Systems for Agentic Workloads - Chipstrat https://www.chipstrat.com/p/right-systems-for-agentic-workloads
[11] With wafer scale chips becoming more popular, what's ... - Reddit [12] Cerebras Wafer-Scale Engine: When to Choose Alternative AI ... https://introl.com/blog/cerebras-wafer-scale-engine-cs3-alternative-ai-architecture-guide-2025
[13] Cerebras' Wafer-Scale Architecture Solves AI Memory Bottleneck https://www.linkedin.com/posts/gaut...the-biggest-activity-7408383432768053248-hF5Z
[14] [PDF] WaferLLM: A Wafer-Scale LLM Inference System - arXiv https://arxiv.org/pdf/2502.04563.pdf
[15] 100x Defect Tolerance: How Cerebras Solved the Yield Problem https://www.cerebras.ai/blog/100x-defect-tolerance-how-cerebras-solved-the-yield-problem
[16] [PDF] Cerebras Wafer-Scale AI - Hot Chips 2024 - https://hc2024.hotchips.org/assets/program/conference/day2/72_HC2024.Cerebras.Sean.v03.final.pdf
[17] Cerebras CS-3 wafer-scale million-core AI chip, 25kW WSE-3, 125 ...
[18] Cerebras Architecture Deep Dive: First Look Inside the HW/SW Co ... https://www.cerebras.ai/blog/cerebr...-inside-the-hw-sw-co-design-for-deep-learning
[19] Cerebras breaks the reticle limit with wafer-scale computing - LinkedIn https://www.linkedin.com/posts/andr...ting-to-see-activity-7392923857269153792-hI2T
 
Awesome work, @KevinK. I'll need to pore over these two posts for awhile. I also need to investigate what Cerebras does in software, and where that software runs. It feels like something is missing.

One caveat that might be relevant here is a caution I sent in email to a friend using ChatGPT a few months ago to analyze some system architecture specs. "AI always appears to be most effective when the answer it generates agrees with what you thought was the answer before you asked the question. Don't fall for the ego stroking. You both may be incorrect."

Another article referencing Cerebras's claim to linear multi-WSE scaling. I'm still keeping an open mind, but a skeptical one.

 
Anybody who has predicted Ethernet's death must have their nose in Jack Daniel's.

Much more like Kavalan than JD.

TW semi community is very bullish on established technologies in general because razed fields open new opportunities for growth.

TW fabless have long been pushed out of high end networking so they bet on interesting new entrants.
 
Awesome work, @KevinK. I'll need to pore over these two posts for awhile. I also need to investigate what Cerebras does in software, and where that software runs. It feels like something is missing.

One caveat that might be relevant here is a caution I sent in email to a friend using ChatGPT a few months ago to analyze some system architecture specs. "AI always appears to be most effective when the answer it generates agrees with what you thought was the answer before you asked the question. Don't fall for the ego stroking. You both may be incorrect."

Another article referencing Cerebras's claim to linear multi-WSE scaling. I'm still keeping an open mind, but a skeptical one.

Great analysis, far better than I could have done... :-)

I agree that multi-WSE scaling being close to linear with number of WSE (with a carefully chosen benchmark, perhaps?) is great, but ignores the big speed impact of going off-wafer in the first place which is what I was talking about. A single Blackwell has around 5x the HBM capacity compared to the Cerebras SRAM, and I don't know how many Blackwells have similar NPU power to one WSE-3 but guessing from power consumption it's maybe 20? Which means that for the same processing power NPU cluster Blackwell has ~100x more "local memory" (HBM) than WSE-3 (SRAM) -- but of course it's also ~100x slower.

Regardless of how this memory is used -- which it seems is very different for the two architectures -- this says to me that there are going to be two model-size-dependent "cliff-edges" where performance suddenly drops massively, one where WSE-3 runs out of SRAM and one where Blackwell runs out of HBM -- below the first one WSE-3 will be faster, between the two Blackwell will be faster, and above the second one it'll be a wash. I'm sure the differences in how memory is used can close this gap by hiding memory latency/BW but 100x size difference is an *awful* lot to make up, even reducing the negative effect by 10x (ambitious!) still leaves a 10x gap.

On top of this the memory size gap can only increase in future, because the SRAM on WSE-3 has pretty much stopped scaling with process node, with almost no shrink for the last few years (down to N2) and no real prospect of this changing in future. In contrast HBM is still scaling, both for cell density and number of layers in a stack, and also the ability to have more stacks or bigger HBM die size, and also going to hybrid bonding between layers instead of micropillar -- put these together and there's a relatively easy path to perhaps ~10x more HBM per NPU in the next 5 years or so, which puts the memory size gap up from ~100x to ~1000x.

Even ignoring cost -- which I'm pretty sure is the elephant in the room for Cerebras! -- this suggests that the slice of the current AI market where they can win is quite small today, and will only shrink as model sizes increase in future.

OTOH that's a small slice of an enormous and rapidly growing AI market, which doesn't mean they won't be successful in their niche -- just that the rest of the AI market will be *more* successful... ;-)
 
Last edited:
OK @blueone and @IanD,
I cannot claim this as my own work, though I did carefully target the questions and pored over the result to make sure it made sense, even though it involved some extrapolation and inference from what's actually documented.

Cerebras handles MoE and KV-heavy transformers mainly by leaning on its weight‑streaming architecture (MemoryX + SwarmX + WSE) rather than the GPU‑style prefill/decode disaggregated serving stacks you see with vLLM/Dynamo/Neuron.[1][2][3]

## MoE on Cerebras

- Cerebras treats MoE as “just another” transformer variant: experts live in the FFN slots, and the compiler maps expert FFNs across the wafer’s cores while the router and gating are implemented in the same graph as dense models.[4][5][6]
- Because parameters are streamed from MemoryX rather than stored on device, you can scale total MoE parameter count (experts × depth) without being bound by on‑wafer SRAM, similar to how they handle multi‑trillion‑parameter dense models.[2][3][1]
- SwarmX + MemoryX keep the system in strict data parallel; a batch of tokens is sharded across CS‑3s, and the MoE routing decisions are local to each shard, so you don’t need custom “expert parallel” routing fabrics as on GPUs.[3][4]

### What this means in practice

- Expert sparsity (top‑k experts per token) reduces *compute* per token, but Cerebras still sees weights layer‑by‑layer via streaming; the main benefit is that fewer expert FFNs are instantiated per token on‑wafer at a time.[3][4]
- The routing network and load‑balancing losses are all handled in the Cerebras compiler graph; debugging tools they ship for “dead experts”/load skew are built into their training/inference workflow rather than a separate serving layer.[7][4]

## Disaggregation vs GPUs

- Cerebras already **disaggregates parameters from compute**: MemoryX is effectively a big parameter server, WSE is the compute plane, and SwarmX is the broadcast/reduce fabric, so model storage is physically separate from compute nodes.[8][2][3]
- However, they do *not* publicly describe GPU‑style **prefill/decode disaggregation** where prefill and decode are run on different workers and KV cache is shipped over the network, as in Neuron “disaggregated inference” or Dynamo‑like designs.[9][10][11]
- Instead, the prefill and decode phases of autoregressive generation both execute on the WSE that owns the active sequence, with the same weight‑streaming machinery; the system’s disaggregation boundary is “parameters vs compute,” not “prefill vs decode.”[2][3]

### Comparing disaggregation styles

| Aspect | Cerebras WSE + MemoryX/SwarmX | GPU disaggregated serving (Neuron, etc.) |
|--------------------------------|-------------------------------------------|-----------------------------------------------------|
| Main disaggregation boundary | Parameters vs compute | Prefill vs decode workers |
| Where KV cache lives | On WSE SRAM per sequence | On decode workers’ GPU memory |
| Cross‑node traffic focus | Streaming weights, gradient reduce | Shipping KV cache between prefill/decode workers |
| Parallelism model | Strict data parallel over replicas | Data + tensor + expert + PD disaggregation |
| MoE scaling focus | Streaming huge expert sets from MemoryX | Routing tokens across expert GPUs + KV movement |
[11][4][8][2][3]

## KV cache handling

- KV cache in a Cerebras transformer still lives in the accelerator’s on‑wafer SRAM during inference; there’s no public description of offloading KV to MemoryX or doing cross‑node KV shipping the way PD disaggregation frameworks do.[11][2][3]
- Instead, the architecture reduces KV pressure by attacking *attention itself*: they show work on sparse attention that halves KV memory by using mostly sparse patterns and only a minority of dense layers in a modified Llama‑style decoder.[12][13]
- Since compute is extremely abundant on the wafer and memory bandwidth is on‑wafer, Cerebras can also afford schemes where recomputation or sparse attention patterns trade a bit of extra math for much lower KV footprint.[13][2]

### Disaggregated transformer implications

- For “disaggregated transformer” ideas (huge off‑chip weights, on‑chip activations/KV) Cerebras is already there: weight streaming makes the transformer effectively disaggregated at the parameter level, but **KV and activations are intentionally kept local** to avoid network latency in the inner loop.[1][2][3]
- If you imagine a DeepSeek‑style stack on Cerebras, the analogue would likely be: experts and dense layers all live in MemoryX, WSE runs prefill+decode on‑wafer with sparse attention to compress KV, and scale‑out is purely via data parallel replicas rather than explicit PD disaggregation and KV shipping.[14][13][3]

Sources
[1] tell me about weight streaming on cerebras https://www.perplexity.ai/search/6d16c517-a056-4ba7-a335-56db3047a1dd
[2] Weight Streaming Execution - Cerebras AI https://training-docs.cerebras.ai/rel-2.5.0/concepts/weight-streaming-execution
[3] Linear Scaling Made Possible with Weight Streaming - Cerebras https://www.cerebras.ai/blog/linear-scaling-made-possible-with-weight-streaming
[4] MoE at Scale: Making Sparse Models Fast on Real Hardware https://www.cerebras.ai/blog/moe-guide-scale
[5] MoE Fundamentals: Why Sparse Models Are the Future of AI https://www.cerebras.ai/blog/moe-guide-why-moe
[6] [PDF] MoEfication: Transformer Feed-forward Layers are Mixtures of Experts https://aclanthology.org/2022.findings-acl.71.pdf
[7] Debugging Dead MoE Models: A Step-by-Step Guide - Cerebras https://www.cerebras.ai/blog/moe-guide-debug
[8] what does scale-up ans scale-up look like for cerebras ? https://www.perplexity.ai/search/10cd6ef1-7c82-45c8-bcec-1bf89dce3758
[9] Tell me about NVIDIA dynamo and multi-headed attention with diaggregation https://www.perplexity.ai/search/9ea90ed3-2196-4d89-aff0-c9819d1b8937
[10] Disaggregated inference https://docs.modular.com/mammoth/disaggregated-inference/
[11] Disaggregated Inference [BETA] — AWS Neuron Documentation https://awsdocs-neuron.readthedocs-...developer_guides/disaggregated-inference.html
[12] Mixture-of-Experts (MoE) LLMs - by Cameron R. Wolfe, Ph.D. https://cameronrwolfe.substack.com/p/moe-llms
[13] Compressing KV cache memory by half with sparse attention https://www.cerebras.ai/blog/compressing-kv-cache-memory-by-half-with-sparse-attention
[14] plese provided more details about disaggregation, KV cache, multi-token predication and communication optimizations https://www.perplexity.ai/search/e7adce4c-eccc-4b5a-9b4f-b244ed5fcf23
[15] MoE-Inference-Bench: Performance Evaluation of Mixture of Expert ... https://arxiv.org/html/2508.17467v1
[16] Prefill-decode disaggregation | LLM Inference Handbook https://bentoml.com/llm/inference-optimization/prefill-decode-disaggregation
[17] How to Build, Train & Debug MoE Models in 2025 - YouTube
[18] Disaggregated Prefill and Decoding Inference System for Large ... https://arxiv.org/abs/2509.17542
[19] Learn Mixture of Experts (MoE) with our new series - LinkedIn https://www.linkedin.com/posts/cerebras-systems_moe-activity-7353453824520404993-lWWg
@IanD and @blueone, the second bit.

Cerebras buys simplicity and bandwidth by *not* having a big, coherent, multi‑processor KV/memory hierarchy—but it also gives up several capabilities that GPU‑style, cache‑centric clusters are starting to exploit.[1][2][3]

### 1. No cross‑device KV sharing

Because KV cache is local to a single WSE and there is no documented, coherent KV fabric across CS‑3s, Cerebras cannot:
- Share a long‑context KV across many concurrent decoders the way disaggregated inference systems do (e.g., one prefill feeding many decode workers).[2][4][5]
- Reuse KV across requests *on different devices* for prefix caching or multi‑tenant fan‑out; reuse is essentially constrained to what fits on one wafer.[1][2]

On GPUs with a KV‑aware serving layer, you can run prefill once, then route many follow‑ups to different decode workers that all see a shared KV pool; Cerebras’ data‑parallel pods don’t expose that kind of KV‑coherence abstraction.[4][5][6]

### 2. Limited global memory hierarchy tricks

Cerebras has a sharp split: on‑wafer SRAM for activations/KV, and off‑wafer MemoryX for weights; there isn’t a multi‑level, shared, device‑coherent cache hierarchy (HBM tiers, host DRAM tiers, NVMe tiers) tuned around KV the way large GPU systems are evolving.[3][7][8]

That means it misses out on:
- Fine‑grained KV offload/bring‑back policies across devices and tiers (HBM ↔ host ↔ NVMe) that let GPU stacks push sequence lengths or batch sizes beyond single‑device KV capacity.[6][9][2]
- Cross‑model or cross‑session KV/page caching: a GPU cluster can, in principle, treat KV like pages in a distributed cache and keep hot prefixes in a multi‑node hierarchy; WSE‑3 does not present that abstraction—it just has very fast local SRAM.[10][2][1]

### 3. Less flexibility for exotic KV‑heavy workloads

The wafer is fantastic when most of the action is “do a lot of math on data that fits in 44 GB SRAM,” but it is less ideal when the *primary* challenge is orchestrating huge KV graphs across many agents or tools. For example:[11][12][1]
- Multi‑agent, tool‑using workloads where many agents share a large, evolving world‑state KV over long horizons benefit from a global KV/memory fabric; Cerebras mostly assumes per‑wafer isolation with data‑parallel replication.[10][1]
- Workloads that want aggressive KV pooling across models (e.g., shared retrieval contexts, shared document windows for dozens of micro‑models) are easier to host on a cluster that treats KV as a first‑class distributed object.[2][6]

Cerebras partly compensates by attacking KV size directly (sparse attention, memory tokens, sensitivity‑based selection of dense vs sparse layers), cutting KV memory nearly in half for Llama‑style models. That’s very good for “one big sequence per wafer,” but not a substitute for a fabric that can move and share KV arbitrarily across many processors.[2]

### 4. Constraints on disaggregated prefill/decode

Because there’s no massive coherent KV hierarchy across CS‑3s, prefill and decode are not cleanly disaggregated across multiple wafers with KV shipping between them in the way Neuron/Dynamo‑style systems do.[5][3]

You lose:
- The ability to run prefill on a large, bursty pool and then hand off only KV to a small, latency‑tuned decode pool across the cluster.[5][6]
- Some of the elasticity and autoscaling patterns where KV is the unit of work moved around the fleet.

Instead, Cerebras relies on weight‑streaming disaggregation (parameters in MemoryX, compute+KV on wafer) and keeps KV local, which simplifies programming and gives terrific per‑device efficiency, but doesn’t expose the emergent “KV‑as‑a‑service” semantics that a massive coherent multi‑processor hierarchy could.[8][13][3]


Sources
[1] Cerebras Wafer-Scale Engine Overview - Emergent Mind https://www.emergentmind.com/topics/cerebras-wafer-scale-engine-wse
[2] Compressing KV cache memory by half with sparse attention https://www.cerebras.ai/blog/compressing-kv-cache-memory-by-half-with-sparse-attention
[3] A Comparison of the Cerebras Wafer-Scale Integration Technology ... https://arxiv.org/html/2503.11698v1
[4] Cerebras WSE-3: A New Frontier for AI Computation - LinkedIn https://www.linkedin.com/posts/yuga...ras-wse3-ai-activity-7359230594641268737-q9UV
[5] Disaggregated Inference [BETA] — AWS Neuron Documentation https://awsdocs-neuron.readthedocs-...developer_guides/disaggregated-inference.html
[6] Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on ... https://arxiv.org/html/2512.16473v1
[7] Cerebras Wafer-Scale Engine Overview - Emergent Mind https://www.emergentmind.com/topics/cerebras-wafer-scale-engine
[8] Weight Streaming Execution - Cerebras AI https://training-docs.cerebras.ai/rel-2.5.0/concepts/weight-streaming-execution
[9] Accelerating Mixture-of-Experts Inference by Hiding Offloading ... https://arxiv.org/html/2508.21706v1
[10] Right Systems for Agentic Workloads - Chipstrat https://www.chipstrat.com/p/right-systems-for-agentic-workloads
[11] With wafer scale chips becoming more popular, what's ... - Reddit [12] Cerebras Wafer-Scale Engine: When to Choose Alternative AI ... https://introl.com/blog/cerebras-wafer-scale-engine-cs3-alternative-ai-architecture-guide-2025
[13] Cerebras' Wafer-Scale Architecture Solves AI Memory Bottleneck https://www.linkedin.com/posts/gaut...the-biggest-activity-7408383432768053248-hF5Z
[14] [PDF] WaferLLM: A Wafer-Scale LLM Inference System - arXiv https://arxiv.org/pdf/2502.04563.pdf
[15] 100x Defect Tolerance: How Cerebras Solved the Yield Problem https://www.cerebras.ai/blog/100x-defect-tolerance-how-cerebras-solved-the-yield-problem
[16] [PDF] Cerebras Wafer-Scale AI - Hot Chips 2024 - https://hc2024.hotchips.org/assets/program/conference/day2/72_HC2024.Cerebras.Sean.v03.final.pdf
[17] Cerebras CS-3 wafer-scale million-core AI chip, 25kW WSE-3, 125 ...
[18] Cerebras Architecture Deep Dive: First Look Inside the HW/SW Co ... https://www.cerebras.ai/blog/cerebr...-inside-the-hw-sw-co-design-for-deep-learning
[19] Cerebras breaks the reticle limit with wafer-scale computing - LinkedIn https://www.linkedin.com/posts/andr...ting-to-see-activity-7392923857269153792-hI2T

I think I finally found what I'm looking for regarding how Cerebras does scale-out in processing flow terms.


Basically, SwarmX and MemoryX turn a scale-out CS-3 configuration into a giant partitioned data parallel processor, with each WSE-3 having a private dataset to perform tensor and matrix operations on, and then only the weights and gradients are communicated to MemoryX for hierarchical storage in DRAM and NAND. The KV sharing discussion from the AI overview is irrelevant to the Cerebras scale-out strategy. The mistake AI is making is it is judging the Cerebras processing flow against the GPU processing flow which has a completely different parallelization and memory management model. (I'd probably reassign a human architect for a mistake like that to a less big picture function. No joke.)

SwarmX also uses a tree interconnect topology to enable low-overhead multi-node broadcasts (needed for the coordination of partitioned data parallel operations) and reduction functions. This isn't the first instance I've seen of an interconnect that provides active processing functions in the interconnect - collective operations in supercomputing interconnects are another example, as was using a tree-topology interconnect to do sorting operations in a database machine - but outside of recent collective operations this is the first example I've seen in a long time.

So now I think I get it. There isn't a "cliff" in CS-3 scale-out efficiency because the processing assigned to an individual WSE-3 is still almost completely local to the SRAM and processors on the chip. The sending of results information (weights and gradients) to the MemoryX units appear to be asynchronous operations, and tiny by comparison to the local processing. The slowdown component in the CS-3 scale-out strategy appears to be the results aggregation phase, which is not enumerated, but must also be a tiny fraction of the total processing.

Long ago we used to talk about message passing scale-out (often oddly called "shared nothing" architectures) and shared storage scale-out. Due to the throughput improvement in networks over the years exceeding the improvement in per processor computing capability, most scale-out storage, database, and computing systems now use shared storage and shared load-store memory architectures rather than pure message passing systems. However, it appears the Cerebras architects have breathed some new life into a greater degree of localized processing with very limited shared data.

In old "shared nothing" systems an efficiency "cliff" was assumed as processing went off-node, to network I/O rather than memory, and estimated to be about 50% for the first node (as compared to only local processing), but then you could achieve a very high scaling factor as you added message passing nodes, depending on the problem space as much as 95% of the 50% adder per node. This assumption was due to the perceived high level of message passing overhead required for the problems of the time, while current AI processing appears to be more easily localizable with much lower multi-node overhead in the Cerebras architecture.
 
Last edited:
I do understand the big differences in architecture and how Cerebras claim to avoid scaling problems with increasingly large AI models -- which sounds suspiciously close to magic, but maybe it really does what it says on the tin.

Which raises the obvious question -- if their way of solving the problem is so much faster and more efficient and more scalable (blindingly obvious advantages, I'd have thought!) than the way everyone else is doing it, why aren't they taking over the AI world as fast as they can get kit out of the door?

Hence my suspicion that they're choosing benchmark cases which show off where they have a big advantage but carefully avoiding ones where they don't (or have a disadvantage). Yes it's what lots of companies do when trying to sell their technology, especially if it's radically different to the norm -- why would they do anything else, that would be crazy?

But it does also mean that it's possible that the emperor really *isn't* wearing anything when he gets out into the real world... ;-)

(or he is, but nobody can afford his solid gold underwear...)

Are there any realistic benchmarks (from a source who doesn't have an axe to grind...) which compare Cerebras performance with "conventional" AI (meaning Nvidia, I guess...) when solving the real problems which are dominating the AI space today? And if so, how this comparison is projected to change in the future with escalating AI model size?
 
I do understand the big differences in architecture and how Cerebras claim to avoid scaling problems with increasingly large AI models -- which sounds suspiciously close to magic, but maybe it really does what it says on the tin.
It doesn't sound like magic to me, just basic computer engineering.
Which raises the obvious question -- if their way of solving the problem is so much faster and more efficient and more scalable (blindingly obvious advantages, I'd have thought!) than the way everyone else is doing it, why aren't they taking over the AI world as fast as they can get kit out of the door?
The other companies are either using a recycled architecture (GPUs), more specialized systolic arrays on traditional accelerator design (TPUs, Tranium), or reconfigurable computing designs (SambaNova and NextSilicon), which don't have similar parallelism capabilities and are at a smaller scale than the WSEs.

Why hasn't Cerebras taken over the world? IMO, the biggest deficiency they have is a simplistic software story (including development tools) compared to Nvidia and AMD. Then there's far less granularity for building large systems than any of GPUs (even Blackwell) or ASICs, the exotic and expensive packaging and cooling you've described, and what amounts to a very expensive bet on a start-up with very little revenue, and the distinct possibility of getting acquired by a hyperscaler, which might end the hardware business altogether in favor of only cloud computing. Talk about the recipe for an expensive wager on a relatively unproven horse.
Hence my suspicion that they're choosing benchmark cases which show off where they have a big advantage but carefully avoiding ones where they don't (or have a disadvantage).
There's no evidence to back up your accusation.
Yes it's what lots of companies do when trying to sell their technology, especially if it's radically different to the norm -- why would they do anything else, that would be crazy?

But it does also mean that it's possible that the emperor really *isn't* wearing anything when he gets out into the real world... ;-)
Not according to the benchmarks.
(or he is, but nobody can afford his solid gold underwear...)
For the time being, I think that is a factor.
Are there any realistic benchmarks (from a source who doesn't have an axe to grind...) which compare Cerebras performance with "conventional" AI (meaning Nvidia, I guess...) when solving the real problems which are dominating the AI space today? And if so, how this comparison is projected to change in the future with escalating AI model size?
Search the SemiAnalysis site for examples.
 
he mistake AI is making is it is judging the Cerebras processing flow against the GPU processing flow which has a completely different parallelization and memory management model. (I'd probably reassign a human architect for a mistake like that to a less big picture function. No joke.)

I agree with you on this - they have their own approach to MoE, that fits their hardware. But I asked the second question because there are known computational advantages to having a large shared KV cache across many users and queries, especially in applications that have large input contexts. It's unclear how much of a benefit this is from a cost or energy perspective with typical AI workloads. That's why I'm eager to see Cerebras system-level results (if possible) from InferenceMax (or now InferenceX)

 
Are there any realistic benchmarks (from a source who doesn't have an axe to grind...) which compare Cerebras performance with "conventional" AI (meaning Nvidia, I guess...) when solving the real problems which are dominating the AI space today? And if so, how this comparison is projected to change in the future with escalating AI model size?

Two thoughts on this -

first off, it's not guaranteed that models will keep growing in ACTIVE size. MoE actually substantially shrunk the number of weights (memory) needed at any single transformer stage/layer, in favor of dynamic conditional (per MoE routing) streaming of weights. This favors hardware that can dynamically accumulate and reuse large of of KV context, based on what I understand.

Second, to keep "spamming" this, but the best system-level AI inference benchmark I know of today is InferenceMax/InferenceX. Artificial Analysis is allegedly working on similar simulated many-user inference workload benchmarks using open models. But who knows when those will be available or when Cerebras will be represented,
 
Back
Top