Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/threads/cisco-launched-its-silicon-one-g300-ai-networking-chip-in-a-move-that-aims-to-compete-with-nvidia-and-broadcom.24521/page-3
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2030970
            [XFI] => 1060170
        )

    [wordpress] => /var/www/html
)

Cisco launched its Silicon One G300 AI networking chip in a move that aims to compete with Nvidia and Broadcom.

Can you point to one of these analyses?

I think you've already decided the answer you will believe, so it looks like asking this question here is not going to be productive. Do you believe GPUs, even with tensor cores and specialized interconnects like NVLink, are an optimal answer for AI training?
At the first order these problems seem to be solved, however, it increases the cost of the system significantly as there are special materials, techniques and tools developed to address these.
The ISO space performance/watt numbers of the CS-3-based systems is better than B200 based systems. However, the ISO space performance/watt/$ performance is much compared to B200 systems and it is evident from the above discussions that the higher cost of solving problems associated with wafer-scale chips are contributing to it.
This is comparing WSA-3 with B200 which is of course no longer the state-of-the-art:

This means models up to ~40 billion parameters (with 16-bit weights) can fit entirely on-chip, enabling them to be run without ever touching off-chip memory. This is a huge advantage: each core gets single-cycle access to weights and activations
When models exceed 44 GB, Cerebras uses a technique called Weight Streaming — the model parameters reside in external MemoryX cabinets (which can be many TB), and the wafer streams in the needed weights each layer.
The catch is that when streaming from off-chip, performance depends on that external memory bandwidth and the sparsity/pattern of weight access.

There's no doubt that the Cerebras solution has fantastic performance so long as the models fit in the WSE memory, but it takes a hit when it doesn't -- and the problem I see here is the almost exponentially increasing size of AI databases, in cases when low-latency memory access is needed. With other solutions it's a lot easier to get a lot more HBM memory close to the NPUs, and this is then considerably faster than the Cerebras external memory for models which fit in HBM. Once they're too big even for this, the playing field levels out again, in fact Cerebras may have an advantage by having fewer levels of memory heirarchy.

But the other issue with Cerebras is cost -- so not performance but performance per dollar. Here the specialized custom hardware (and much lower volumes) put it at a considerable disadvantage.

If neither of these is correct, why has Cerebras not taken over the entire AI world and wiped out the competition?

Your last question -- nope, absolutely not -- they do a good job, but for sure something more exactly tailored to the task could do it better. But then it would also have less flexibility, the classic custom ASIC problem... ;-)
 
Do you believe GPUs, even with tensor cores and specialized interconnects like NVLink, are an optimal answer for AI training?
My personal opinion is that as models evolve and improve, elements of both training and inference are becoming increasingly specialized, requiring tons of flexibility in interconnect, memory configuration/hierarchy and malleability of heavy compute resources. With the current generation of MoE disaggregated transformer models, there isn't just a single stage/process for training or inference - there are numerous stages requiring different mixes of compute/memory/io resources. For instance, training is really a 3-stage process of pretraining → Supervised fine‑tuning (SFT) → Reinforcement Learning with Human Feedback (RLHF) / preference‑based post‑training.

For a frontier MoE LLM, a typical split might look like this in terms of FLOPs (not counting data or human costs):
• Pretraining: ~80–95% of total FLOPs, even with MoE efficiency gains.
• SFT: a few percent (order 1–10% depending on scale and number of passes).
• RLHF / preference tuning (including reward model): also a few percent, sometimes similar to or somewhat larger than SFT, but still far below pretraining.

MoE substantially reduces the FLOPs needed for pretraining vs dense model training for a given capability level but also introduces the need for sophisticated routing balancing. And if you're a Chinese model like DeepSeek, there might be a big "distilling someone else's model step" ;)

I do think that models are also somewhat evolving, both from a training and inference perspective, to take best advantage of current and planned GPU hardware, except perhaps Gemini which is better tuned for TPU systems. But that evolution means that any piece of highly dedicated hardware for models today, would likely be DOA by the time the silicon made it into working systems, 3-4 years later.
 
At the first order these problems seem to be solved, however, it increases the cost of the system significantly as there are special materials, techniques and tools developed to address these.
The ISO space performance/watt numbers of the CS-3-based systems is better than B200 based systems. However, the ISO space performance/watt/$ performance is much compared to B200 systems and it is evident from the above discussions that the higher cost of solving problems associated with wafer-scale chips are contributing to it.
This is comparing WSA-3 with B200 which is of course no longer the state-of-the-art:

This means models up to ~40 billion parameters (with 16-bit weights) can fit entirely on-chip, enabling them to be run without ever touching off-chip memory. This is a huge advantage: each core gets single-cycle access to weights and activations
When models exceed 44 GB, Cerebras uses a technique called Weight Streaming — the model parameters reside in external MemoryX cabinets (which can be many TB), and the wafer streams in the needed weights each layer.
The catch is that when streaming from off-chip, performance depends on that external memory bandwidth and the sparsity/pattern of weight access.

There's no doubt that the Cerebras solution has fantastic performance so long as the models fit in the WSE memory, but it takes a hit when it doesn't -- and the problem I see here is the almost exponentially increasing size of AI databases, in cases when low-latency memory access is needed. With other solutions it's a lot easier to get a lot more HBM memory close to the NPUs, and this is then considerably faster than the Cerebras external memory for models which fit in HBM. Once they're too big even for this, the playing field levels out again, in fact Cerebras may have an advantage by having fewer levels of memory heirarchy.

But the other issue with Cerebras is cost -- so not performance but performance per dollar. Here the specialized custom hardware (and much lower volumes) put it at a considerable disadvantage.

If neither of these is correct, why has Cerebras not taken over the entire AI world and wiped out the competition?

Your last question -- nope, absolutely not -- they do a good job, but for sure something more exactly tailored to the task could do it better. But then it would also have less flexibility, the classic custom ASIC problem... ;-)
Your two references are weak.

Nether one of them contains any deep architecture discussion, implementation discussions or comparisons, and no measurements. The first is written by a "Go To Market" company. ?? I do agree with the concerns about Cerebras packaging, cooling, and power requirements, which is why I have mentioned before that I think their immediate future lies in cloud-based access. There's no significant information in the article about MemoryX architecture or how it relates to training flows for very large models. I don't see how it supports your hypothesis at all.

The second paper, a Medium post actually, was better, but it still made me chuckle, perhaps more than the first. The author refers to the Google TPU as a "Broadcom ASIC", and groups it with the AWS Tranium? That's ridiculous. The post contains some nicely written high-level discussions about the processor architecture of the several products it discusses, but the discussions are high level, and don't specifically support or not support the assertion you're making about Cerebras off-chip MemoryX access with analysis or data.

While this paper does not examine the performance of a Cerebras multi-node system, it is more of the calibre of paper I look for in product analysis:


Outside of a possible G42 system, I have serious doubts that there are any Cerebras MemoryX installations in customer sites. (I can't find any references to one.) Perhaps when one is installed and measured, we'll see some published results. Until then it is difficult to draw supported conclusions.
 
Back
Top