Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/threads/cisco-launched-its-silicon-one-g300-ai-networking-chip-in-a-move-that-aims-to-compete-with-nvidia-and-broadcom.24521/page-3
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2030970
            [XFI] => 1060170
        )

    [wordpress] => /var/www/html
)

Cisco launched its Silicon One G300 AI networking chip in a move that aims to compete with Nvidia and Broadcom.

Can you point to one of these analyses?

I think you've already decided the answer you will believe, so it looks like asking this question here is not going to be productive. Do you believe GPUs, even with tensor cores and specialized interconnects like NVLink, are an optimal answer for AI training?
At the first order these problems seem to be solved, however, it increases the cost of the system significantly as there are special materials, techniques and tools developed to address these.
The ISO space performance/watt numbers of the CS-3-based systems is better than B200 based systems. However, the ISO space performance/watt/$ performance is much compared to B200 systems and it is evident from the above discussions that the higher cost of solving problems associated with wafer-scale chips are contributing to it.
This is comparing WSA-3 with B200 which is of course no longer the state-of-the-art:

This means models up to ~40 billion parameters (with 16-bit weights) can fit entirely on-chip, enabling them to be run without ever touching off-chip memory. This is a huge advantage: each core gets single-cycle access to weights and activations
When models exceed 44 GB, Cerebras uses a technique called Weight Streaming — the model parameters reside in external MemoryX cabinets (which can be many TB), and the wafer streams in the needed weights each layer.
The catch is that when streaming from off-chip, performance depends on that external memory bandwidth and the sparsity/pattern of weight access.

There's no doubt that the Cerebras solution has fantastic performance so long as the models fit in the WSE memory, but it takes a hit when it doesn't -- and the problem I see here is the almost exponentially increasing size of AI databases, in cases when low-latency memory access is needed. With other solutions it's a lot easier to get a lot more HBM memory close to the NPUs, and this is then considerably faster than the Cerebras external memory for models which fit in HBM. Once they're too big even for this, the playing field levels out again, in fact Cerebras may have an advantage by having fewer levels of memory heirarchy.

But the other issue with Cerebras is cost -- so not performance but performance per dollar. Here the specialized custom hardware (and much lower volumes) put it at a considerable disadvantage.

If neither of these is correct, why has Cerebras not taken over the entire AI world and wiped out the competition?

Your last question -- nope, absolutely not -- they do a good job, but for sure something more exactly tailored to the task could do it better. But then it would also have less flexibility, the classic custom ASIC problem... ;-)
 
Do you believe GPUs, even with tensor cores and specialized interconnects like NVLink, are an optimal answer for AI training?
My personal opinion is that as models evolve and improve, elements of both training and inference are becoming increasingly specialized, requiring tons of flexibility in interconnect, memory configuration/hierarchy and malleability of heavy compute resources. With the current generation of MoE disaggregated transformer models, there isn't just a single stage/process for training or inference - there are numerous stages requiring different mixes of compute/memory/io resources. For instance, training is really a 3-stage process of pretraining → Supervised fine‑tuning (SFT) → Reinforcement Learning with Human Feedback (RLHF) / preference‑based post‑training.

For a frontier MoE LLM, a typical split might look like this in terms of FLOPs (not counting data or human costs):
• Pretraining: ~80–95% of total FLOPs, even with MoE efficiency gains.
• SFT: a few percent (order 1–10% depending on scale and number of passes).
• RLHF / preference tuning (including reward model): also a few percent, sometimes similar to or somewhat larger than SFT, but still far below pretraining.

MoE substantially reduces the FLOPs needed for pretraining vs dense model training for a given capability level but also introduces the need for sophisticated routing balancing. And if you're a Chinese model like DeepSeek, there might be a big "distilling someone else's model step" ;)

I do think that models are also somewhat evolving, both from a training and inference perspective, to take best advantage of current and planned GPU hardware, except perhaps Gemini which is better tuned for TPU systems. But that evolution means that any piece of highly dedicated hardware for models today, would likely be DOA by the time the silicon made it into working systems, 3-4 years later.
 
At the first order these problems seem to be solved, however, it increases the cost of the system significantly as there are special materials, techniques and tools developed to address these.
The ISO space performance/watt numbers of the CS-3-based systems is better than B200 based systems. However, the ISO space performance/watt/$ performance is much compared to B200 systems and it is evident from the above discussions that the higher cost of solving problems associated with wafer-scale chips are contributing to it.
This is comparing WSA-3 with B200 which is of course no longer the state-of-the-art:

This means models up to ~40 billion parameters (with 16-bit weights) can fit entirely on-chip, enabling them to be run without ever touching off-chip memory. This is a huge advantage: each core gets single-cycle access to weights and activations
When models exceed 44 GB, Cerebras uses a technique called Weight Streaming — the model parameters reside in external MemoryX cabinets (which can be many TB), and the wafer streams in the needed weights each layer.
The catch is that when streaming from off-chip, performance depends on that external memory bandwidth and the sparsity/pattern of weight access.

There's no doubt that the Cerebras solution has fantastic performance so long as the models fit in the WSE memory, but it takes a hit when it doesn't -- and the problem I see here is the almost exponentially increasing size of AI databases, in cases when low-latency memory access is needed. With other solutions it's a lot easier to get a lot more HBM memory close to the NPUs, and this is then considerably faster than the Cerebras external memory for models which fit in HBM. Once they're too big even for this, the playing field levels out again, in fact Cerebras may have an advantage by having fewer levels of memory heirarchy.

But the other issue with Cerebras is cost -- so not performance but performance per dollar. Here the specialized custom hardware (and much lower volumes) put it at a considerable disadvantage.

If neither of these is correct, why has Cerebras not taken over the entire AI world and wiped out the competition?

Your last question -- nope, absolutely not -- they do a good job, but for sure something more exactly tailored to the task could do it better. But then it would also have less flexibility, the classic custom ASIC problem... ;-)
Your two references are weak.

Nether one of them contains any deep architecture discussion, implementation discussions or comparisons, and no measurements. The first is written by a "Go To Market" company. ?? I do agree with the concerns about Cerebras packaging, cooling, and power requirements, which is why I have mentioned before that I think their immediate future lies in cloud-based access. There's no significant information in the article about MemoryX architecture or how it relates to training flows for very large models. I don't see how it supports your hypothesis at all.

The second paper, a Medium post actually, was better, but it still made me chuckle, perhaps more than the first. The author refers to the Google TPU as a "Broadcom ASIC", and groups it with the AWS Tranium? That's ridiculous. The post contains some nicely written high-level discussions about the processor architecture of the several products it discusses, but the discussions are high level, and don't specifically support or not support the assertion you're making about Cerebras off-chip MemoryX access with analysis or data.

While this paper does not examine the performance of a Cerebras multi-node system, it is more of the calibre of paper I look for in product analysis:


Outside of a possible G42 system, I have serious doubts that there are any Cerebras MemoryX installations in customer sites. (I can't find any references to one.) Perhaps when one is installed and measured, we'll see some published results. Until then it is difficult to draw supported conclusions.
 
My personal opinion is that as models evolve and improve, elements of both training and inference are becoming increasingly specialized, requiring tons of flexibility in interconnect, memory configuration/hierarchy and malleability of heavy compute resources. With the current generation of MoE disaggregated transformer models, there isn't just a single stage/process for training or inference - there are numerous stages requiring different mixes of compute/memory/io resources. For instance, training is really a 3-stage process of pretraining → Supervised fine‑tuning (SFT) → Reinforcement Learning with Human Feedback (RLHF) / preference‑based post‑training.

For a frontier MoE LLM, a typical split might look like this in terms of FLOPs (not counting data or human costs):
• Pretraining: ~80–95% of total FLOPs, even with MoE efficiency gains.
• SFT: a few percent (order 1–10% depending on scale and number of passes).
• RLHF / preference tuning (including reward model): also a few percent, sometimes similar to or somewhat larger than SFT, but still far below pretraining.

MoE substantially reduces the FLOPs needed for pretraining vs dense model training for a given capability level but also introduces the need for sophisticated routing balancing. And if you're a Chinese model like DeepSeek, there might be a big "distilling someone else's model step" ;)

I do think that models are also somewhat evolving, both from a training and inference perspective, to take best advantage of current and planned GPU hardware, except perhaps Gemini which is better tuned for TPU systems. But that evolution means that any piece of highly dedicated hardware for models today, would likely be DOA by the time the silicon made it into working systems, 3-4 years later.
agree the AI HW needs will be a lot mor diversified, depending on different stage mentioned here, and even perhaps multiple different foundational models (not just all LLM based). So in a way, whoever can master the process of putting out such customized silicon in a radically faster pace will win.
 
So in a way, whoever can master the process of putting out such customized silicon in a radically faster pace will win.
Two thoughts:
* agree that we will see new specialized foundational frontier models, that have very different architectures from the current MoE transformers. But MoE transformers are proving to be far more generalizable than their original text-based forerunners. I’m seeing new multimodal variants (image, video) and co-optimization in the pipeline. My view is that GNNs will eventually enter the fray.
* I’m betting that software-controlled flexible hardware with small granularity heavily interconnected accelerators will evolve faster for co-optimization than dedicated hardware.
 
Just out of curiosity because of the controversy over Cerebras CS-3 system clustering scalability, I began to poke around with some internet searches. One fascinating link that showed up was a programming guide portion of the Cerebras website dedicated to CS-3 components, and how to build training applications specifically for the CS-3. Their documentation is surprisingly frank, especially for publicly available information, and deeply technical. Too technical for me, for example, with specifics of PyTorch programming, and forced me to educate myself further. For those with insatiable curiosity, or trouble getting to sleep, I recommend this section of the Cerebras website:


Unfortunately, I haven't found any cluster system scaling measurements yet, but embedded in the text you'll see bravado about how near linear the performance scaling is. Hmmm. I continue to be skeptical about how many clustered systems have been built, because the MemoryX and SwarmX nodes still appear to be based on networking technology that is multiple generations old (100Gb/s Ethernet). Or, I suppose, it is possible that MemoryX operations are latency-sensitive and not throughput constrained, because the weight data is not large.
 
Just out of curiosity because of the controversy over Cerebras CS-3 system clustering scalability, I began to poke around with some internet searches. One fascinating link that showed up was a programming guide portion of the Cerebras website dedicated to CS-3 components, and how to build training applications specifically for the CS-3. Their documentation is surprisingly frank, especially for publicly available information, and deeply technical. Too technical for me, for example, with specifics of PyTorch programming, and forced me to educate myself further. For those with insatiable curiosity, or trouble getting to sleep, I recommend this section of the Cerebras website:


Unfortunately, I haven't found any cluster system scaling measurements yet, but embedded in the text you'll see bravado about how near linear the performance scaling is. Hmmm. I continue to be skeptical about how many clustered systems have been built, because the MemoryX and SwarmX nodes still appear to be based on networking technology that is multiple generations old (100Gb/s Ethernet). Or, I suppose, it is possible that MemoryX operations are latency-sensitive and not throughput constrained, because the weight data is not large.
I don't think there's any doubt about Cerebras scaleability, or the excellent performance when the problem fits "on-wafer" since there's a massive pool of NPUs and SRAM with ridiculous interconnect bandwidth and low latency.

The question is what happens when it *won't* fit on-wafer, and specifically because for a given amount of NPU processing power there's a *lot* less close-cache memory than there is with the HBM stacks in conventional architectures -- I haven't crunched the numbers, but given SRAM-vs-HBM density/die area and HBM stack height I suspect the HBM is bigger by at least 10x, which means at least a 10x bigger model will fit. The HBM latency and bandwidth isn't as good as the Cerebras SRAM -- and in either case once you get past the capacity of SRAM/HBM you have to go to off-board memory which is much slower and higher latency, same for either system. You can try and hide the latency/bandwidth but this is only going to work sometimes.

That suggests to me that up to a x1 model size (which will fit into Cerebras on-wafer SRAM) Cerebras will have a considerable performance advantage, which is what the benchmarks show.

From this model size up to maybe 10x (or more? -- see below) a conventional architecture using HBM should be considerably faster than Cerebras which has to go off-board to fetch data.

Above 10x size both architectures should be similar, assuming similar mass memory storage and comms links to it -- what one can do, so can the other.

So the question is -- where do the AI model sizes sit today compared to these 3 regions, and where will they go in future? It seems that the size is rapidly increasing, which you'd think means more and more will not sit in the Cerebras sweet spot any more but will move into the region where HBM wins. And it's *much* easier to add more HBM into a system (if you can get it, of course...) than it is to expand the Cerebras SRAM (which has stopped scaling!), so if the gap in local memory is 10x now (is it? or is it 20x with newer HBM?) it wouldn't be difficult to double this, or even more in future since HBM is still scaling but SRAM isn't.

==> Is there any genuine data out there which answers this question?

On top of that, even if Cerebras win in performance they have a distinct disadvantage in performance per dollar given the exotic system construction -- so maybe they win in cases where absolute performance matters more than cost (government-funded labs?), but only get a small fraction of the overall AI market which -- lets face it -- is going to be heavily cost-driven (including power consumption).

And once the model is too big to even fit in local HBM any more, performance should be a wash but Cerebras will be at a considerable cost disadvantage -- which means, dead in the water.

Am I missing something?
 
I don't think there's any doubt about Cerebras scaleability, or the excellent performance when the problem fits "on-wafer" since there's a massive pool of NPUs and SRAM with ridiculous interconnect bandwidth and low latency.
I think you raise a bunch of valid questions about limits of performance vs model size and cost. Plus your earlier question was great as well - how big is the market for super fast response time (low latency with high token rates) at high TCO. I think the real question is whether Cerebras can confirm their architecture to support the system-level model/system optimizations happening in data center level work today. It’s a positive that they are operating their own data centers so there is cost/performance pressure to directly deliver (no middlemen). But we don’t have SemiAnalysis-like benchmarks to give us greater insight yet.
 
I don't think there's any doubt about Cerebras scaleability, or the excellent performance when the problem fits "on-wafer" since there's a massive pool of NPUs and SRAM with ridiculous interconnect bandwidth and low latency.

The question is what happens when it *won't* fit on-wafer, and specifically because for a given amount of NPU processing power there's a *lot* less close-cache memory than there is with the HBM stacks in conventional architectures -- I haven't crunched the numbers, but given SRAM-vs-HBM density/die area and HBM stack height I suspect the HBM is bigger by at least 10x, which means at least a 10x bigger model will fit. The HBM latency and bandwidth isn't as good as the Cerebras SRAM -- and in either case once you get past the capacity of SRAM/HBM you have to go to off-board memory which is much slower and higher latency, same for either system. You can try and hide the latency/bandwidth but this is only going to work sometimes.
To make my position completely clear, since you seem confused about it, the only scalability I'm referring to in my responses are multi-WSE-3 configurations linked though SwarmX and MemoryX. Scalability within a single WSE-3 is a given in my mind.

Comparing Cerebras's memory hierarchy to Nvidia's and AMD's HBM usage seems very complex. For example, as you mentioned, the WSE-3 SRAM size is 44GB. Each Blackwell GPU has 192GB or 288GB of HBM, depending on it being the Ultra version (two-die) or not. And that's just for one Blackwell. HBM, as you probably know, is high bandwidth, but has greater latency to the first byte of an access than DDR5. While, of course, the WSE-3 SRAM has latency claimed to be lower than 1ns, so probably 1/100th the latency of HBM. The other difference is that Blackwell keeps several different types of data in HBM, including KV caches, intermediate results, and collectives. Cerebras insists SwarmX and MemoryX are only used for sharing weights between WSE-3s, which not only confuses me, but tells me the processing model for the WSE-3 is completely different than what Nvidia does in its GPUs. I haven't found sufficient architecture details on WSE-3 processing flows in a form I can comprehend, though there is detailed processing flow descriptions in their 1.4.0 SDK, which I haven't studied yet.


That suggests to me that up to a x1 model size (which will fit into Cerebras on-wafer SRAM) Cerebras will have a considerable performance advantage, which is what the benchmarks show.

From this model size up to maybe 10x (or more? -- see below) a conventional architecture using HBM should be considerably faster than Cerebras which has to go off-board to fetch data.

Above 10x size both architectures should be similar, assuming similar mass memory storage and comms links to it -- what one can do, so can the other.

So the question is -- where do the AI model sizes sit today compared to these 3 regions, and where will they go in future? It seems that the size is rapidly increasing, which you'd think means more and more will not sit in the Cerebras sweet spot any more but will move into the region where HBM wins. And it's *much* easier to add more HBM into a system (if you can get it, of course...) than it is to expand the Cerebras SRAM (which has stopped scaling!), so if the gap in local memory is 10x now (is it? or is it 20x with newer HBM?) it wouldn't be difficult to double this, or even more in future since HBM is still scaling but SRAM isn't.
Given the fundamental differences in the use models for the memory hierarchies of the WSE-3 and Blackwell, and the highly contrasted technology differences, I can't answer this question yet. There are only two possibilities: you have a fundamental misunderstanding about how the WSE-3 works in a multi-node configuration, or Cerebras is lying about inter-WSE-3 scaling efficiency. I don't see a middle ground.
==> Is there any genuine data out there which answers this question?
Not that I've found yet. I'm thinking about just asking Cerebras directly.
On top of that, even if Cerebras win in performance they have a distinct disadvantage in performance per dollar given the exotic system construction -- so maybe they win in cases where absolute performance matters more than cost (government-funded labs?), but only get a small fraction of the overall AI market which -- lets face it -- is going to be heavily cost-driven (including power consumption).

And once the model is too big to even fit in local HBM any more, performance should be a wash but Cerebras will be at a considerable cost disadvantage -- which means, dead in the water.

Am I missing something?
I think Cerebras does have a disadvantage in requiring exotic system construction, housing, and maintenance. (I like the term "exotic".) We also don't know what their systems really cost. Like, how much are Sandia Labs and G42 really paying for their WSE-3 hardware? Since Cerebras is private, we also don't know how much money they're losing.
 
OK @blueone and @IanD,
I cannot claim this as my own work, though I did carefully target the questions and pored over the result to make sure it made sense, even though it involved some extrapolation and inference from what's actually documented.

Cerebras handles MoE and KV-heavy transformers mainly by leaning on its weight‑streaming architecture (MemoryX + SwarmX + WSE) rather than the GPU‑style prefill/decode disaggregated serving stacks you see with vLLM/Dynamo/Neuron.[1][2][3]

## MoE on Cerebras

- Cerebras treats MoE as “just another” transformer variant: experts live in the FFN slots, and the compiler maps expert FFNs across the wafer’s cores while the router and gating are implemented in the same graph as dense models.[4][5][6]
- Because parameters are streamed from MemoryX rather than stored on device, you can scale total MoE parameter count (experts × depth) without being bound by on‑wafer SRAM, similar to how they handle multi‑trillion‑parameter dense models.[2][3][1]
- SwarmX + MemoryX keep the system in strict data parallel; a batch of tokens is sharded across CS‑3s, and the MoE routing decisions are local to each shard, so you don’t need custom “expert parallel” routing fabrics as on GPUs.[3][4]

### What this means in practice

- Expert sparsity (top‑k experts per token) reduces *compute* per token, but Cerebras still sees weights layer‑by‑layer via streaming; the main benefit is that fewer expert FFNs are instantiated per token on‑wafer at a time.[3][4]
- The routing network and load‑balancing losses are all handled in the Cerebras compiler graph; debugging tools they ship for “dead experts”/load skew are built into their training/inference workflow rather than a separate serving layer.[7][4]

## Disaggregation vs GPUs

- Cerebras already **disaggregates parameters from compute**: MemoryX is effectively a big parameter server, WSE is the compute plane, and SwarmX is the broadcast/reduce fabric, so model storage is physically separate from compute nodes.[8][2][3]
- However, they do *not* publicly describe GPU‑style **prefill/decode disaggregation** where prefill and decode are run on different workers and KV cache is shipped over the network, as in Neuron “disaggregated inference” or Dynamo‑like designs.[9][10][11]
- Instead, the prefill and decode phases of autoregressive generation both execute on the WSE that owns the active sequence, with the same weight‑streaming machinery; the system’s disaggregation boundary is “parameters vs compute,” not “prefill vs decode.”[2][3]

### Comparing disaggregation styles

| Aspect | Cerebras WSE + MemoryX/SwarmX | GPU disaggregated serving (Neuron, etc.) |
|--------------------------------|-------------------------------------------|-----------------------------------------------------|
| Main disaggregation boundary | Parameters vs compute | Prefill vs decode workers |
| Where KV cache lives | On WSE SRAM per sequence | On decode workers’ GPU memory |
| Cross‑node traffic focus | Streaming weights, gradient reduce | Shipping KV cache between prefill/decode workers |
| Parallelism model | Strict data parallel over replicas | Data + tensor + expert + PD disaggregation |
| MoE scaling focus | Streaming huge expert sets from MemoryX | Routing tokens across expert GPUs + KV movement |
[11][4][8][2][3]

## KV cache handling

- KV cache in a Cerebras transformer still lives in the accelerator’s on‑wafer SRAM during inference; there’s no public description of offloading KV to MemoryX or doing cross‑node KV shipping the way PD disaggregation frameworks do.[11][2][3]
- Instead, the architecture reduces KV pressure by attacking *attention itself*: they show work on sparse attention that halves KV memory by using mostly sparse patterns and only a minority of dense layers in a modified Llama‑style decoder.[12][13]
- Since compute is extremely abundant on the wafer and memory bandwidth is on‑wafer, Cerebras can also afford schemes where recomputation or sparse attention patterns trade a bit of extra math for much lower KV footprint.[13][2]

### Disaggregated transformer implications

- For “disaggregated transformer” ideas (huge off‑chip weights, on‑chip activations/KV) Cerebras is already there: weight streaming makes the transformer effectively disaggregated at the parameter level, but **KV and activations are intentionally kept local** to avoid network latency in the inner loop.[1][2][3]
- If you imagine a DeepSeek‑style stack on Cerebras, the analogue would likely be: experts and dense layers all live in MemoryX, WSE runs prefill+decode on‑wafer with sparse attention to compress KV, and scale‑out is purely via data parallel replicas rather than explicit PD disaggregation and KV shipping.[14][13][3]

Sources
[1] tell me about weight streaming on cerebras https://www.perplexity.ai/search/6d16c517-a056-4ba7-a335-56db3047a1dd
[2] Weight Streaming Execution - Cerebras AI https://training-docs.cerebras.ai/rel-2.5.0/concepts/weight-streaming-execution
[3] Linear Scaling Made Possible with Weight Streaming - Cerebras https://www.cerebras.ai/blog/linear-scaling-made-possible-with-weight-streaming
[4] MoE at Scale: Making Sparse Models Fast on Real Hardware https://www.cerebras.ai/blog/moe-guide-scale
[5] MoE Fundamentals: Why Sparse Models Are the Future of AI https://www.cerebras.ai/blog/moe-guide-why-moe
[6] [PDF] MoEfication: Transformer Feed-forward Layers are Mixtures of Experts https://aclanthology.org/2022.findings-acl.71.pdf
[7] Debugging Dead MoE Models: A Step-by-Step Guide - Cerebras https://www.cerebras.ai/blog/moe-guide-debug
[8] what does scale-up ans scale-up look like for cerebras ? https://www.perplexity.ai/search/10cd6ef1-7c82-45c8-bcec-1bf89dce3758
[9] Tell me about NVIDIA dynamo and multi-headed attention with diaggregation https://www.perplexity.ai/search/9ea90ed3-2196-4d89-aff0-c9819d1b8937
[10] Disaggregated inference https://docs.modular.com/mammoth/disaggregated-inference/
[11] Disaggregated Inference [BETA] — AWS Neuron Documentation https://awsdocs-neuron.readthedocs-...developer_guides/disaggregated-inference.html
[12] Mixture-of-Experts (MoE) LLMs - by Cameron R. Wolfe, Ph.D. https://cameronrwolfe.substack.com/p/moe-llms
[13] Compressing KV cache memory by half with sparse attention https://www.cerebras.ai/blog/compressing-kv-cache-memory-by-half-with-sparse-attention
[14] plese provided more details about disaggregation, KV cache, multi-token predication and communication optimizations https://www.perplexity.ai/search/e7adce4c-eccc-4b5a-9b4f-b244ed5fcf23
[15] MoE-Inference-Bench: Performance Evaluation of Mixture of Expert ... https://arxiv.org/html/2508.17467v1
[16] Prefill-decode disaggregation | LLM Inference Handbook https://bentoml.com/llm/inference-optimization/prefill-decode-disaggregation
[17] How to Build, Train & Debug MoE Models in 2025 - YouTube
[18] Disaggregated Prefill and Decoding Inference System for Large ... https://arxiv.org/abs/2509.17542
[19] Learn Mixture of Experts (MoE) with our new series - LinkedIn https://www.linkedin.com/posts/cerebras-systems_moe-activity-7353453824520404993-lWWg
 
Last edited:
@IanD and @blueone, the second bit.

Cerebras buys simplicity and bandwidth by *not* having a big, coherent, multi‑processor KV/memory hierarchy—but it also gives up several capabilities that GPU‑style, cache‑centric clusters are starting to exploit.[1][2][3]

### 1. No cross‑device KV sharing

Because KV cache is local to a single WSE and there is no documented, coherent KV fabric across CS‑3s, Cerebras cannot:
- Share a long‑context KV across many concurrent decoders the way disaggregated inference systems do (e.g., one prefill feeding many decode workers).[2][4][5]
- Reuse KV across requests *on different devices* for prefix caching or multi‑tenant fan‑out; reuse is essentially constrained to what fits on one wafer.[1][2]

On GPUs with a KV‑aware serving layer, you can run prefill once, then route many follow‑ups to different decode workers that all see a shared KV pool; Cerebras’ data‑parallel pods don’t expose that kind of KV‑coherence abstraction.[4][5][6]

### 2. Limited global memory hierarchy tricks

Cerebras has a sharp split: on‑wafer SRAM for activations/KV, and off‑wafer MemoryX for weights; there isn’t a multi‑level, shared, device‑coherent cache hierarchy (HBM tiers, host DRAM tiers, NVMe tiers) tuned around KV the way large GPU systems are evolving.[3][7][8]

That means it misses out on:
- Fine‑grained KV offload/bring‑back policies across devices and tiers (HBM ↔ host ↔ NVMe) that let GPU stacks push sequence lengths or batch sizes beyond single‑device KV capacity.[6][9][2]
- Cross‑model or cross‑session KV/page caching: a GPU cluster can, in principle, treat KV like pages in a distributed cache and keep hot prefixes in a multi‑node hierarchy; WSE‑3 does not present that abstraction—it just has very fast local SRAM.[10][2][1]

### 3. Less flexibility for exotic KV‑heavy workloads

The wafer is fantastic when most of the action is “do a lot of math on data that fits in 44 GB SRAM,” but it is less ideal when the *primary* challenge is orchestrating huge KV graphs across many agents or tools. For example:[11][12][1]
- Multi‑agent, tool‑using workloads where many agents share a large, evolving world‑state KV over long horizons benefit from a global KV/memory fabric; Cerebras mostly assumes per‑wafer isolation with data‑parallel replication.[10][1]
- Workloads that want aggressive KV pooling across models (e.g., shared retrieval contexts, shared document windows for dozens of micro‑models) are easier to host on a cluster that treats KV as a first‑class distributed object.[2][6]

Cerebras partly compensates by attacking KV size directly (sparse attention, memory tokens, sensitivity‑based selection of dense vs sparse layers), cutting KV memory nearly in half for Llama‑style models. That’s very good for “one big sequence per wafer,” but not a substitute for a fabric that can move and share KV arbitrarily across many processors.[2]

### 4. Constraints on disaggregated prefill/decode

Because there’s no massive coherent KV hierarchy across CS‑3s, prefill and decode are not cleanly disaggregated across multiple wafers with KV shipping between them in the way Neuron/Dynamo‑style systems do.[5][3]

You lose:
- The ability to run prefill on a large, bursty pool and then hand off only KV to a small, latency‑tuned decode pool across the cluster.[5][6]
- Some of the elasticity and autoscaling patterns where KV is the unit of work moved around the fleet.

Instead, Cerebras relies on weight‑streaming disaggregation (parameters in MemoryX, compute+KV on wafer) and keeps KV local, which simplifies programming and gives terrific per‑device efficiency, but doesn’t expose the emergent “KV‑as‑a‑service” semantics that a massive coherent multi‑processor hierarchy could.[8][13][3]


Sources
[1] Cerebras Wafer-Scale Engine Overview - Emergent Mind https://www.emergentmind.com/topics/cerebras-wafer-scale-engine-wse
[2] Compressing KV cache memory by half with sparse attention https://www.cerebras.ai/blog/compressing-kv-cache-memory-by-half-with-sparse-attention
[3] A Comparison of the Cerebras Wafer-Scale Integration Technology ... https://arxiv.org/html/2503.11698v1
[4] Cerebras WSE-3: A New Frontier for AI Computation - LinkedIn https://www.linkedin.com/posts/yuga...ras-wse3-ai-activity-7359230594641268737-q9UV
[5] Disaggregated Inference [BETA] — AWS Neuron Documentation https://awsdocs-neuron.readthedocs-...developer_guides/disaggregated-inference.html
[6] Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on ... https://arxiv.org/html/2512.16473v1
[7] Cerebras Wafer-Scale Engine Overview - Emergent Mind https://www.emergentmind.com/topics/cerebras-wafer-scale-engine
[8] Weight Streaming Execution - Cerebras AI https://training-docs.cerebras.ai/rel-2.5.0/concepts/weight-streaming-execution
[9] Accelerating Mixture-of-Experts Inference by Hiding Offloading ... https://arxiv.org/html/2508.21706v1
[10] Right Systems for Agentic Workloads - Chipstrat https://www.chipstrat.com/p/right-systems-for-agentic-workloads
[11] With wafer scale chips becoming more popular, what's ... - Reddit [12] Cerebras Wafer-Scale Engine: When to Choose Alternative AI ... https://introl.com/blog/cerebras-wafer-scale-engine-cs3-alternative-ai-architecture-guide-2025
[13] Cerebras' Wafer-Scale Architecture Solves AI Memory Bottleneck https://www.linkedin.com/posts/gaut...the-biggest-activity-7408383432768053248-hF5Z
[14] [PDF] WaferLLM: A Wafer-Scale LLM Inference System - arXiv https://arxiv.org/pdf/2502.04563.pdf
[15] 100x Defect Tolerance: How Cerebras Solved the Yield Problem https://www.cerebras.ai/blog/100x-defect-tolerance-how-cerebras-solved-the-yield-problem
[16] [PDF] Cerebras Wafer-Scale AI - Hot Chips 2024 - https://hc2024.hotchips.org/assets/program/conference/day2/72_HC2024.Cerebras.Sean.v03.final.pdf
[17] Cerebras CS-3 wafer-scale million-core AI chip, 25kW WSE-3, 125 ...
[18] Cerebras Architecture Deep Dive: First Look Inside the HW/SW Co ... https://www.cerebras.ai/blog/cerebr...-inside-the-hw-sw-co-design-for-deep-learning
[19] Cerebras breaks the reticle limit with wafer-scale computing - LinkedIn https://www.linkedin.com/posts/andr...ting-to-see-activity-7392923857269153792-hI2T
 
Awesome work, @KevinK. I'll need to pore over these two posts for awhile. I also need to investigate what Cerebras does in software, and where that software runs. It feels like something is missing.

One caveat that might be relevant here is a caution I sent in email to a friend using ChatGPT a few months ago to analyze some system architecture specs. "AI always appears to be most effective when the answer it generates agrees with what you thought was the answer before you asked the question. Don't fall for the ego stroking. You both may be incorrect."

Another article referencing Cerebras's claim to linear multi-WSE scaling. I'm still keeping an open mind, but a skeptical one.

 
Back
Top