Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/threads/cisco-launched-its-silicon-one-g300-ai-networking-chip-in-a-move-that-aims-to-compete-with-nvidia-and-broadcom.24521/page-3
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2030970
            [XFI] => 1060170
        )

    [wordpress] => /var/www/html
)

Cisco launched its Silicon One G300 AI networking chip in a move that aims to compete with Nvidia and Broadcom.

Can you point to one of these analyses?

I think you've already decided the answer you will believe, so it looks like asking this question here is not going to be productive. Do you believe GPUs, even with tensor cores and specialized interconnects like NVLink, are an optimal answer for AI training?
At the first order these problems seem to be solved, however, it increases the cost of the system significantly as there are special materials, techniques and tools developed to address these.
The ISO space performance/watt numbers of the CS-3-based systems is better than B200 based systems. However, the ISO space performance/watt/$ performance is much compared to B200 systems and it is evident from the above discussions that the higher cost of solving problems associated with wafer-scale chips are contributing to it.
This is comparing WSA-3 with B200 which is of course no longer the state-of-the-art:

This means models up to ~40 billion parameters (with 16-bit weights) can fit entirely on-chip, enabling them to be run without ever touching off-chip memory. This is a huge advantage: each core gets single-cycle access to weights and activations
When models exceed 44 GB, Cerebras uses a technique called Weight Streaming — the model parameters reside in external MemoryX cabinets (which can be many TB), and the wafer streams in the needed weights each layer.
The catch is that when streaming from off-chip, performance depends on that external memory bandwidth and the sparsity/pattern of weight access.

There's no doubt that the Cerebras solution has fantastic performance so long as the models fit in the WSE memory, but it takes a hit when it doesn't -- and the problem I see here is the almost exponentially increasing size of AI databases, in cases when low-latency memory access is needed. With other solutions it's a lot easier to get a lot more HBM memory close to the NPUs, and this is then considerably faster than the Cerebras external memory for models which fit in HBM. Once they're too big even for this, the playing field levels out again, in fact Cerebras may have an advantage by having fewer levels of memory heirarchy.

But the other issue with Cerebras is cost -- so not performance but performance per dollar. Here the specialized custom hardware (and much lower volumes) put it at a considerable disadvantage.

If neither of these is correct, why has Cerebras not taken over the entire AI world and wiped out the competition?

Your last question -- nope, absolutely not -- they do a good job, but for sure something more exactly tailored to the task could do it better. But then it would also have less flexibility, the classic custom ASIC problem... ;-)
 
Do you believe GPUs, even with tensor cores and specialized interconnects like NVLink, are an optimal answer for AI training?
My personal opinion is that as models evolve and improve, elements of both training and inference are becoming increasingly specialized, requiring tons of flexibility in interconnect, memory configuration/hierarchy and malleability of heavy compute resources. With the current generation of MoE disaggregated transformer models, there isn't just a single stage/process for training or inference - there are numerous stages requiring different mixes of compute/memory/io resources. For instance, training is really a 3-stage process of pretraining → Supervised fine‑tuning (SFT) → Reinforcement Learning with Human Feedback (RLHF) / preference‑based post‑training.

For a frontier MoE LLM, a typical split might look like this in terms of FLOPs (not counting data or human costs):
• Pretraining: ~80–95% of total FLOPs, even with MoE efficiency gains.
• SFT: a few percent (order 1–10% depending on scale and number of passes).
• RLHF / preference tuning (including reward model): also a few percent, sometimes similar to or somewhat larger than SFT, but still far below pretraining.

MoE substantially reduces the FLOPs needed for pretraining vs dense model training for a given capability level but also introduces the need for sophisticated routing balancing. And if you're a Chinese model like DeepSeek, there might be a big "distilling someone else's model step" ;)

I do think that models are also somewhat evolving, both from a training and inference perspective, to take best advantage of current and planned GPU hardware, except perhaps Gemini which is better tuned for TPU systems. But that evolution means that any piece of highly dedicated hardware for models today, would likely be DOA by the time the silicon made it into working systems, 3-4 years later.
 
At the first order these problems seem to be solved, however, it increases the cost of the system significantly as there are special materials, techniques and tools developed to address these.
The ISO space performance/watt numbers of the CS-3-based systems is better than B200 based systems. However, the ISO space performance/watt/$ performance is much compared to B200 systems and it is evident from the above discussions that the higher cost of solving problems associated with wafer-scale chips are contributing to it.
This is comparing WSA-3 with B200 which is of course no longer the state-of-the-art:

This means models up to ~40 billion parameters (with 16-bit weights) can fit entirely on-chip, enabling them to be run without ever touching off-chip memory. This is a huge advantage: each core gets single-cycle access to weights and activations
When models exceed 44 GB, Cerebras uses a technique called Weight Streaming — the model parameters reside in external MemoryX cabinets (which can be many TB), and the wafer streams in the needed weights each layer.
The catch is that when streaming from off-chip, performance depends on that external memory bandwidth and the sparsity/pattern of weight access.

There's no doubt that the Cerebras solution has fantastic performance so long as the models fit in the WSE memory, but it takes a hit when it doesn't -- and the problem I see here is the almost exponentially increasing size of AI databases, in cases when low-latency memory access is needed. With other solutions it's a lot easier to get a lot more HBM memory close to the NPUs, and this is then considerably faster than the Cerebras external memory for models which fit in HBM. Once they're too big even for this, the playing field levels out again, in fact Cerebras may have an advantage by having fewer levels of memory heirarchy.

But the other issue with Cerebras is cost -- so not performance but performance per dollar. Here the specialized custom hardware (and much lower volumes) put it at a considerable disadvantage.

If neither of these is correct, why has Cerebras not taken over the entire AI world and wiped out the competition?

Your last question -- nope, absolutely not -- they do a good job, but for sure something more exactly tailored to the task could do it better. But then it would also have less flexibility, the classic custom ASIC problem... ;-)
Your two references are weak.

Nether one of them contains any deep architecture discussion, implementation discussions or comparisons, and no measurements. The first is written by a "Go To Market" company. ?? I do agree with the concerns about Cerebras packaging, cooling, and power requirements, which is why I have mentioned before that I think their immediate future lies in cloud-based access. There's no significant information in the article about MemoryX architecture or how it relates to training flows for very large models. I don't see how it supports your hypothesis at all.

The second paper, a Medium post actually, was better, but it still made me chuckle, perhaps more than the first. The author refers to the Google TPU as a "Broadcom ASIC", and groups it with the AWS Tranium? That's ridiculous. The post contains some nicely written high-level discussions about the processor architecture of the several products it discusses, but the discussions are high level, and don't specifically support or not support the assertion you're making about Cerebras off-chip MemoryX access with analysis or data.

While this paper does not examine the performance of a Cerebras multi-node system, it is more of the calibre of paper I look for in product analysis:


Outside of a possible G42 system, I have serious doubts that there are any Cerebras MemoryX installations in customer sites. (I can't find any references to one.) Perhaps when one is installed and measured, we'll see some published results. Until then it is difficult to draw supported conclusions.
 
My personal opinion is that as models evolve and improve, elements of both training and inference are becoming increasingly specialized, requiring tons of flexibility in interconnect, memory configuration/hierarchy and malleability of heavy compute resources. With the current generation of MoE disaggregated transformer models, there isn't just a single stage/process for training or inference - there are numerous stages requiring different mixes of compute/memory/io resources. For instance, training is really a 3-stage process of pretraining → Supervised fine‑tuning (SFT) → Reinforcement Learning with Human Feedback (RLHF) / preference‑based post‑training.

For a frontier MoE LLM, a typical split might look like this in terms of FLOPs (not counting data or human costs):
• Pretraining: ~80–95% of total FLOPs, even with MoE efficiency gains.
• SFT: a few percent (order 1–10% depending on scale and number of passes).
• RLHF / preference tuning (including reward model): also a few percent, sometimes similar to or somewhat larger than SFT, but still far below pretraining.

MoE substantially reduces the FLOPs needed for pretraining vs dense model training for a given capability level but also introduces the need for sophisticated routing balancing. And if you're a Chinese model like DeepSeek, there might be a big "distilling someone else's model step" ;)

I do think that models are also somewhat evolving, both from a training and inference perspective, to take best advantage of current and planned GPU hardware, except perhaps Gemini which is better tuned for TPU systems. But that evolution means that any piece of highly dedicated hardware for models today, would likely be DOA by the time the silicon made it into working systems, 3-4 years later.
agree the AI HW needs will be a lot mor diversified, depending on different stage mentioned here, and even perhaps multiple different foundational models (not just all LLM based). So in a way, whoever can master the process of putting out such customized silicon in a radically faster pace will win.
 
So in a way, whoever can master the process of putting out such customized silicon in a radically faster pace will win.
Two thoughts:
* agree that we will see new specialized foundational frontier models, that have very different architectures from the current MoE transformers. But MoE transformers are proving to be far more generalizable than their original text-based forerunners. I’m seeing new multimodal variants (image, video) and co-optimization in the pipeline. My view is that GNNs will eventually enter the fray.
* I’m betting that software-controlled flexible hardware with small granularity heavily interconnected accelerators will evolve faster for co-optimization than dedicated hardware.
 
Just out of curiosity because of the controversy over Cerebras CS-3 system clustering scalability, I began to poke around with some internet searches. One fascinating link that showed up was a programming guide portion of the Cerebras website dedicated to CS-3 components, and how to build training applications specifically for the CS-3. Their documentation is surprisingly frank, especially for publicly available information, and deeply technical. Too technical for me, for example, with specifics of PyTorch programming, and forced me to educate myself further. For those with insatiable curiosity, or trouble getting to sleep, I recommend this section of the Cerebras website:


Unfortunately, I haven't found any cluster system scaling measurements yet, but embedded in the text you'll see bravado about how near linear the performance scaling is. Hmmm. I continue to be skeptical about how many clustered systems have been built, because the MemoryX and SwarmX nodes still appear to be based on networking technology that is multiple generations old (100Gb/s Ethernet). Or, I suppose, it is possible that MemoryX operations are latency-sensitive and not throughput constrained, because the weight data is not large.
 
Just out of curiosity because of the controversy over Cerebras CS-3 system clustering scalability, I began to poke around with some internet searches. One fascinating link that showed up was a programming guide portion of the Cerebras website dedicated to CS-3 components, and how to build training applications specifically for the CS-3. Their documentation is surprisingly frank, especially for publicly available information, and deeply technical. Too technical for me, for example, with specifics of PyTorch programming, and forced me to educate myself further. For those with insatiable curiosity, or trouble getting to sleep, I recommend this section of the Cerebras website:


Unfortunately, I haven't found any cluster system scaling measurements yet, but embedded in the text you'll see bravado about how near linear the performance scaling is. Hmmm. I continue to be skeptical about how many clustered systems have been built, because the MemoryX and SwarmX nodes still appear to be based on networking technology that is multiple generations old (100Gb/s Ethernet). Or, I suppose, it is possible that MemoryX operations are latency-sensitive and not throughput constrained, because the weight data is not large.
I don't think there's any doubt about Cerebras scaleability, or the excellent performance when the problem fits "on-wafer" since there's a massive pool of NPUs and SRAM with ridiculous interconnect bandwidth and low latency.

The question is what happens when it *won't* fit on-wafer, and specifically because for a given amount of NPU processing power there's a *lot* less close-cache memory than there is with the HBM stacks in conventional architectures -- I haven't crunched the numbers, but given SRAM-vs-HBM density/die area and HBM stack height I suspect the HBM is bigger by at least 10x, which means at least a 10x bigger model will fit. The HBM latency and bandwidth isn't as good as the Cerebras SRAM -- and in either case once you get past the capacity of SRAM/HBM you have to go to off-board memory which is much slower and higher latency, same for either system. You can try and hide the latency/bandwidth but this is only going to work sometimes.

That suggests to me that up to a x1 model size (which will fit into Cerebras on-wafer SRAM) Cerebras will have a considerable performance advantage, which is what the benchmarks show.

From this model size up to maybe 10x (or more? -- see below) a conventional architecture using HBM should be considerably faster than Cerebras which has to go off-board to fetch data.

Above 10x size both architectures should be similar, assuming similar mass memory storage and comms links to it -- what one can do, so can the other.

So the question is -- where do the AI model sizes sit today compared to these 3 regions, and where will they go in future? It seems that the size is rapidly increasing, which you'd think means more and more will not sit in the Cerebras sweet spot any more but will move into the region where HBM wins. And it's *much* easier to add more HBM into a system (if you can get it, of course...) than it is to expand the Cerebras SRAM (which has stopped scaling!), so if the gap in local memory is 10x now (is it? or is it 20x with newer HBM?) it wouldn't be difficult to double this, or even more in future since HBM is still scaling but SRAM isn't.

==> Is there any genuine data out there which answers this question?

On top of that, even if Cerebras win in performance they have a distinct disadvantage in performance per dollar given the exotic system construction -- so maybe they win in cases where absolute performance matters more than cost (government-funded labs?), but only get a small fraction of the overall AI market which -- lets face it -- is going to be heavily cost-driven (including power consumption).

And once the model is too big to even fit in local HBM any more, performance should be a wash but Cerebras will be at a considerable cost disadvantage -- which means, dead in the water.

Am I missing something?
 
I don't think there's any doubt about Cerebras scaleability, or the excellent performance when the problem fits "on-wafer" since there's a massive pool of NPUs and SRAM with ridiculous interconnect bandwidth and low latency.
I think you raise a bunch of valid questions about limits of performance vs model size and cost. Plus your earlier question was great as well - how big is the market for super fast response time (low latency with high token rates) at high TCO. I think the real question is whether Cerebras can confirm their architecture to support the system-level model/system optimizations happening in data center level work today. It’s a positive that they are operating their own data centers so there is cost/performance pressure to directly deliver (no middlemen). But we don’t have SemiAnalysis-like benchmarks to give us greater insight yet.
 
I don't think there's any doubt about Cerebras scaleability, or the excellent performance when the problem fits "on-wafer" since there's a massive pool of NPUs and SRAM with ridiculous interconnect bandwidth and low latency.

The question is what happens when it *won't* fit on-wafer, and specifically because for a given amount of NPU processing power there's a *lot* less close-cache memory than there is with the HBM stacks in conventional architectures -- I haven't crunched the numbers, but given SRAM-vs-HBM density/die area and HBM stack height I suspect the HBM is bigger by at least 10x, which means at least a 10x bigger model will fit. The HBM latency and bandwidth isn't as good as the Cerebras SRAM -- and in either case once you get past the capacity of SRAM/HBM you have to go to off-board memory which is much slower and higher latency, same for either system. You can try and hide the latency/bandwidth but this is only going to work sometimes.
To make my position completely clear, since you seem confused about it, the only scalability I'm referring to in my responses are multi-WSE-3 configurations linked though SwarmX and MemoryX. Scalability within a single WSE-3 is a given in my mind.

Comparing Cerebras's memory hierarchy to Nvidia's and AMD's HBM usage seems very complex. For example, as you mentioned, the WSE-3 SRAM size is 44GB. Each Blackwell GPU has 192GB or 288GB of HBM, depending on it being the Ultra version (two-die) or not. And that's just for one Blackwell. HBM, as you probably know, is high bandwidth, but has greater latency to the first byte of an access than DDR5. While, of course, the WSE-3 SRAM has latency claimed to be lower than 1ns, so probably 1/100th the latency of HBM. The other difference is that Blackwell keeps several different types of data in HBM, including KV caches, intermediate results, and collectives. Cerebras insists SwarmX and MemoryX are only used for sharing weights between WSE-3s, which not only confuses me, but tells me the processing model for the WSE-3 is completely different than what Nvidia does in its GPUs. I haven't found sufficient architecture details on WSE-3 processing flows in a form I can comprehend, though there is detailed processing flow descriptions in their 1.4.0 SDK, which I haven't studied yet.


That suggests to me that up to a x1 model size (which will fit into Cerebras on-wafer SRAM) Cerebras will have a considerable performance advantage, which is what the benchmarks show.

From this model size up to maybe 10x (or more? -- see below) a conventional architecture using HBM should be considerably faster than Cerebras which has to go off-board to fetch data.

Above 10x size both architectures should be similar, assuming similar mass memory storage and comms links to it -- what one can do, so can the other.

So the question is -- where do the AI model sizes sit today compared to these 3 regions, and where will they go in future? It seems that the size is rapidly increasing, which you'd think means more and more will not sit in the Cerebras sweet spot any more but will move into the region where HBM wins. And it's *much* easier to add more HBM into a system (if you can get it, of course...) than it is to expand the Cerebras SRAM (which has stopped scaling!), so if the gap in local memory is 10x now (is it? or is it 20x with newer HBM?) it wouldn't be difficult to double this, or even more in future since HBM is still scaling but SRAM isn't.
Given the fundamental differences in the use models for the memory hierarchies of the WSE-3 and Blackwell, and the highly contrasted technology differences, I can't answer this question yet. There are only two possibilities: you have a fundamental misunderstanding about how the WSE-3 works in a multi-node configuration, or Cerebras is lying about inter-WSE-3 scaling efficiency. I don't see a middle ground.
==> Is there any genuine data out there which answers this question?
Not that I've found yet. I'm thinking about just asking Cerebras directly.
On top of that, even if Cerebras win in performance they have a distinct disadvantage in performance per dollar given the exotic system construction -- so maybe they win in cases where absolute performance matters more than cost (government-funded labs?), but only get a small fraction of the overall AI market which -- lets face it -- is going to be heavily cost-driven (including power consumption).

And once the model is too big to even fit in local HBM any more, performance should be a wash but Cerebras will be at a considerable cost disadvantage -- which means, dead in the water.

Am I missing something?
I think Cerebras does have a disadvantage in requiring exotic system construction, housing, and maintenance. (I like the term "exotic".) We also don't know what their systems really cost. Like, how much are Sandia Labs and G42 really paying for their WSE-3 hardware? Since Cerebras is private, we also don't know how much money they're losing.
 
Back
Top