Cerebras uses dedicated servers, called MemoryX servers, which are SwarmX fabric-connected to the WSE-3 nodes. The MemoryX configuration can include up to 1.2PB of shared memory storage, consisting of DDR5 and Flash tiers. There is 44GB of SRAM on each WSE-3, and the SRAM has far lower latency and fabric latency than any HBM.
Simple case below for 2024 - 2024 was all about fitting large dense models into memory. 2025 has gotten far more complicated with MoE and bunches of "smarter" but smaller expert model-ettes activating, then coordinating results back to all experts. Add in separating pre-fill and decode to different groups of processors and building and managing a KV cache. Not sure how they do the 2025-2026 edition.
If the model fits on one wafer
• A CS‑3 has 44 GB of on‑chip SRAM across the wafer; for many production LLMs, all weights can be placed on‑wafer for inference.
• In that regime you don’t need weight streaming at runtime: parameters sit in SRAM next to the cores, so inference runs purely on‑wafer without repeatedly fetching weights from external memory.
• This is where Cerebras reports 10× faster LLM inference vs GPU clusters, driven by much higher effective memory bandwidth and no HBM/PCIe hops.
If the model is larger than one wafer
• For very large models whose weights exceed 44 GB per wafer, Cerebras can reuse the same weight‑streaming mechanism as in training:
• Weights reside in external MemoryX.
• For each layer (or group of layers), weights are streamed onto the wafer; activations stay on‑wafer; results move forward layer‑by‑layer.
• Latency is hidden the same way as in training: by overlapping weight transfers with compute and by exploiting coarse‑ and fine‑grained parallelism on the mesh.
Scaling out inference
• Multiple CS systems can serve inference together using SwarmX plus MemoryX, similar to training:
• MemoryX holds one or more model copies; SwarmX broadcasts weights (if streaming is used) and aggregates any needed results.
• For many inference workloads, the preferred pattern is replicated models across CS‑3s, each handling its own request stream, so most traffic stays local to each wafer and SwarmX is used more for management than for per‑token communication.