OK
@blueone and
@IanD,
I cannot claim this as my own work, though I did carefully target the questions and pored over the result to make sure it made sense, even though it involved some extrapolation and inference from what's actually documented.
Cerebras handles MoE and KV-heavy transformers mainly by leaning on its weight‑streaming architecture (MemoryX + SwarmX + WSE) rather than the GPU‑style prefill/decode disaggregated serving stacks you see with vLLM/Dynamo/Neuron.[1][2][3]
## MoE on Cerebras
- Cerebras treats MoE as “just another” transformer variant: experts live in the FFN slots, and the compiler maps expert FFNs across the wafer’s cores while the router and gating are implemented in the same graph as dense models.[4][5][6]
- Because parameters are streamed from MemoryX rather than stored on device, you can scale total MoE parameter count (experts × depth) without being bound by on‑wafer SRAM, similar to how they handle multi‑trillion‑parameter dense models.[2][3][1]
- SwarmX + MemoryX keep the system in strict data parallel; a batch of tokens is sharded across CS‑3s, and the MoE routing decisions are local to each shard, so you don’t need custom “expert parallel” routing fabrics as on GPUs.[3][4]
### What this means in practice
- Expert sparsity (top‑k experts per token) reduces *compute* per token, but Cerebras still sees weights layer‑by‑layer via streaming; the main benefit is that fewer expert FFNs are instantiated per token on‑wafer at a time.[3][4]
- The routing network and load‑balancing losses are all handled in the Cerebras compiler graph; debugging tools they ship for “dead experts”/load skew are built into their training/inference workflow rather than a separate serving layer.[7][4]
## Disaggregation vs GPUs
- Cerebras already **disaggregates parameters from compute**: MemoryX is effectively a big parameter server, WSE is the compute plane, and SwarmX is the broadcast/reduce fabric, so model storage is physically separate from compute nodes.[8][2][3]
- However, they do *not* publicly describe GPU‑style **prefill/decode disaggregation** where prefill and decode are run on different workers and KV cache is shipped over the network, as in Neuron “disaggregated inference” or Dynamo‑like designs.[9][10][11]
- Instead, the prefill and decode phases of autoregressive generation both execute on the WSE that owns the active sequence, with the same weight‑streaming machinery; the system’s disaggregation boundary is “parameters vs compute,” not “prefill vs decode.”[2][3]
### Comparing disaggregation styles
| Aspect | Cerebras WSE + MemoryX/SwarmX | GPU disaggregated serving (Neuron, etc.) |
|--------------------------------|-------------------------------------------|-----------------------------------------------------|
| Main disaggregation boundary | Parameters vs compute | Prefill vs decode workers |
| Where KV cache lives | On WSE SRAM per sequence | On decode workers’ GPU memory |
| Cross‑node traffic focus | Streaming weights, gradient reduce | Shipping KV cache between prefill/decode workers |
| Parallelism model | Strict data parallel over replicas | Data + tensor + expert + PD disaggregation |
| MoE scaling focus | Streaming huge expert sets from MemoryX | Routing tokens across expert GPUs + KV movement |
[11][4][8][2][3]
## KV cache handling
- KV cache in a Cerebras transformer still lives in the accelerator’s on‑wafer SRAM during inference; there’s no public description of offloading KV to MemoryX or doing cross‑node KV shipping the way PD disaggregation frameworks do.[11][2][3]
- Instead, the architecture reduces KV pressure by attacking *attention itself*: they show work on sparse attention that halves KV memory by using mostly sparse patterns and only a minority of dense layers in a modified Llama‑style decoder.[12][13]
- Since compute is extremely abundant on the wafer and memory bandwidth is on‑wafer, Cerebras can also afford schemes where recomputation or sparse attention patterns trade a bit of extra math for much lower KV footprint.[13][2]
### Disaggregated transformer implications
- For “disaggregated transformer” ideas (huge off‑chip weights, on‑chip activations/KV) Cerebras is already there: weight streaming makes the transformer effectively disaggregated at the parameter level, but **KV and activations are intentionally kept local** to avoid network latency in the inner loop.[1][2][3]
- If you imagine a DeepSeek‑style stack on Cerebras, the analogue would likely be: experts and dense layers all live in MemoryX, WSE runs prefill+decode on‑wafer with sparse attention to compress KV, and scale‑out is purely via data parallel replicas rather than explicit PD disaggregation and KV shipping.[14][13][3]
Sources
[1] tell me about weight streaming on cerebras
https://www.perplexity.ai/search/6d16c517-a056-4ba7-a335-56db3047a1dd
[2] Weight Streaming Execution - Cerebras AI
https://training-docs.cerebras.ai/rel-2.5.0/concepts/weight-streaming-execution
[3] Linear Scaling Made Possible with Weight Streaming - Cerebras
https://www.cerebras.ai/blog/linear-scaling-made-possible-with-weight-streaming
[4] MoE at Scale: Making Sparse Models Fast on Real Hardware
https://www.cerebras.ai/blog/moe-guide-scale
[5] MoE Fundamentals: Why Sparse Models Are the Future of AI
https://www.cerebras.ai/blog/moe-guide-why-moe
[6] [PDF] MoEfication: Transformer Feed-forward Layers are Mixtures of Experts
https://aclanthology.org/2022.findings-acl.71.pdf
[7] Debugging Dead MoE Models: A Step-by-Step Guide - Cerebras
https://www.cerebras.ai/blog/moe-guide-debug
[8] what does scale-up ans scale-up look like for cerebras ?
https://www.perplexity.ai/search/10cd6ef1-7c82-45c8-bcec-1bf89dce3758
[9] Tell me about NVIDIA dynamo and multi-headed attention with diaggregation
https://www.perplexity.ai/search/9ea90ed3-2196-4d89-aff0-c9819d1b8937
[10] Disaggregated inference
https://docs.modular.com/mammoth/disaggregated-inference/
[11] Disaggregated Inference [BETA] — AWS Neuron Documentation
https://awsdocs-neuron.readthedocs-...developer_guides/disaggregated-inference.html
[12] Mixture-of-Experts (MoE) LLMs - by Cameron R. Wolfe, Ph.D.
https://cameronrwolfe.substack.com/p/moe-llms
[13] Compressing KV cache memory by half with sparse attention
https://www.cerebras.ai/blog/compressing-kv-cache-memory-by-half-with-sparse-attention
[14] plese provided more details about disaggregation, KV cache, multi-token predication and communication optimizations
https://www.perplexity.ai/search/e7adce4c-eccc-4b5a-9b4f-b244ed5fcf23
[15] MoE-Inference-Bench: Performance Evaluation of Mixture of Expert ...
https://arxiv.org/html/2508.17467v1
[16] Prefill-decode disaggregation | LLM Inference Handbook
https://bentoml.com/llm/inference-optimization/prefill-decode-disaggregation
[17] How to Build, Train & Debug MoE Models in 2025 - YouTube
[18] Disaggregated Prefill and Decoding Inference System for Large ...
https://arxiv.org/abs/2509.17542
[19] Learn Mixture of Experts (MoE) with our new series - LinkedIn
https://www.linkedin.com/posts/cerebras-systems_moe-activity-7353453824520404993-lWWg