Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/threads/cerebras-to-raise-ipo-price-range-to-150-160-as-demand-surges-sources-say.25074/page-2
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2031070
            [XFI] => 1060170
        )

    [wordpress] => /var/www/html
)

Cerebras to raise IPO price range to $150-$160 as demand surges, sources say

Three updates on this one:
Since Cerebras supports multi-user workloads on common hardware, doesn't this capability answer your question?
1) This article is much better than the S1 for us hardware types.


2) The article delves into the economics and potential Pareto curve and limitations of Cerebras - I'm not going to summarize them all here. Most important is that Cerebras is good at fast tokens and they are 6x more valuable (at least at current rates from OpenAI) vs normal tokens. No actual Pareto frontiers for Cerebras yet, though.

The company is considering a new IPO price range of $150-$160 a share, up from $115-$125 a share, and raising the number of shares marketed to 30 million from 28 million, said the sources, who asked not to be identified because the information isn't public yet.
At the ⁠top of the new range, Cerebras would raise roughly $4.8 billion, up from $3.5 billion under its original terms, though the figures remain subject to change before pricing, the people said.
3) Pricing is now $185 / share
 
Three updates on this one:

1) This article is much better than the S1 for us hardware types.


2) The article delves into the economics and potential Pareto curve and limitations of Cerebras - I'm not going to summarize them all here. Most important is that Cerebras is good at fast tokens and they are 6x more valuable (at least at current rates from OpenAI) vs normal tokens. No actual Pareto frontiers for Cerebras yet, though.


3) Pricing is now $185 / share
I read the section on the WSE-3 I/O networking earlier. I'm still going over it to make sure I understand what the authors are saying, but I'm not convinced the article accurately represents how the Cerebras system uses I/O. Yet.
 
I've often wondered... how much would Cerebras be worth if they achieved the same results without wafer-scale? How big is the wafer-scale premium? Can wafer-scale really be applied more broadly than AI? (I'm currently a skeptic.)
 
how much would Cerebras be worth if they achieved the same results without wafer-scale?
Not sure they could have. What would they have done - another Groq ?
How big is the wafer-scale premium?
That's a good question - maybe we'll get a better view on their Pareto cost / interactivity tradeoffs as they grow. And their collaboration with Amazon on disaggregation should be revealing in how well they can build heterogeneous systems that offer a broader set of model operating cost / interactivity points.
Can wafer-scale really be applied more broadly than AI?
Who knows ? They made a good bet on AI as the killer app back in 2016 - an app that benefits from huge scale / density / interconnect, plus plenty of high bandwidth SRAM with sufficient value that the added cost (yield, HW and SW R&D, cooling) is secondary. Maybe there will be another app like AI some day. But until then, there's still a lot of running room with AI in the data center.
 
An interesting tidbit that I noticed when I was looking at the Cerebras website... one of original individual investors in Cerebras was Lip Bu Tan. It's a very distinguished list, including names like Andy Bechtolsheim, Pradeep Sindhu, Dadi Perlmutter, Fred Webber, and Nick McKeown, among others.
 
As I type this Cerebras (CBRS) is trading at $301/share.

Even higher after hours.

2025 revenue $510M

Is interesting if nothing else.

The cost of AI isnt coming down anytime soon methinks if these guys are a big driver. I assume they want to pump that revenue otherwise their valuation is nonsense.
 
I read the section on the WSE-3 I/O networking earlier. I'm still going over it to make sure I understand what the authors are saying, but I'm not convinced the article accurately represents how the Cerebras system uses I/O. Yet.
Based on the evidence SemiAnalysis offers, I'm not sure that MemoryX and SwarmX I/O work as well for streaming weights (inference & training) and gradients (training) as positioned. AFAIK, their wins today leverage sharding a smallish model within just a few WSEs

"The key takeaway is that Cerebras, while fast, pays a large latency cost to move data on and off the wafer, and therefore their cost-to-performance ratio (or perf per Joule) will depend on how much of that latency they can hide or minimize. A clue about the difficulty of this in practice may be reflected in Model offerings on Cerebras Inference Cloud. The largest production model is GPT-OSS, which is only 120B total parameters. There are larger preview models, but even those top out at 355B (GLM 4.7). For reference, Sonnet and Opus are 1T and 5T parameters respectively, per Elon. Notably, the formerly popular Llama 70B and 405B models were also deprecated, potentially due to the economics of serving them."

"Cerebras’s chips are only economically capable of serving relatively small models today, or at least based on what’s available to the public. GPT-5.3-Codex-Spark, for example, is NOT at all the same thing as the full GPT-5.3-Codex; it’s gpt-oss-120b fine-tuned on GPT-5.3-codex traces. In other words, it’s a distilled model that’s over 10x smaller."

Found the video version of this article to be helpful as well:

 
Based on the evidence SemiAnalysis offers, I'm not sure that MemoryX and SwarmX I/O work as well for streaming weights (inference & training) and gradients (training) as positioned. AFAIK, their wins today leverage sharding a smallish model within just a few WSEs

"The key takeaway is that Cerebras, while fast, pays a large latency cost to move data on and off the wafer, and therefore their cost-to-performance ratio (or perf per Joule) will depend on how much of that latency they can hide or minimize. A clue about the difficulty of this in practice may be reflected in Model offerings on Cerebras Inference Cloud. The largest production model is GPT-OSS, which is only 120B total parameters. There are larger preview models, but even those top out at 355B (GLM 4.7). For reference, Sonnet and Opus are 1T and 5T parameters respectively, per Elon. Notably, the formerly popular Llama 70B and 405B models were also deprecated, potentially due to the economics of serving them."

"Cerebras’s chips are only economically capable of serving relatively small models today, or at least based on what’s available to the public. GPT-5.3-Codex-Spark, for example, is NOT at all the same thing as the full GPT-5.3-Codex; it’s gpt-oss-120b fine-tuned on GPT-5.3-codex traces. In other words, it’s a distilled model that’s over 10x smaller."

Found the video version of this article to be helpful as well:

I'm not buying this explanation yet. MemoryX units are only used to store weights. The parameters are declustered into each WSE-3 node, then the processing is all local, so there isn't inter-node traffic for processing like there is for Nvidia systems. It looks like the MemoryX Ethernet links are only for filling and draining the 44GB SRAM, which at ~10GB/s per link wouldn't take very long. But I'm not confident that I fully understand model processing yet, because I've never read the code being executed. I've gotten lazy and distracted in my old age. I'm also not sure about how mature Cerebras multi-node systems are.
 
What does their roadmap look like?

Their advantage is wafer-scale, and they have taken up the whole wafer already. So, if they want to grow faster than process shrink, they will have to go off wafer and their current advantage/differentiation will be gone.

In fact, I would submit that once they have to go "multi-wafer", they will be at a disadvantage since their I/O coast line per wafer is a lot less than a regular design.
 
Last edited:
In fact, I would submit that once they have to go "multi-wafer", they will be at a disadvantage since their I/O coast line per wafer is a lot less than a regular design.
Cerebras claims that once the weights are stored in the SRAM, all processing is local to a node. So there is supposedly no multi-wafer processing. I'm still studying their claims, but I haven't found an obvious hole yet.


 
Cerebras claims that once the weights are stored in the SRAM, all processing is local to a node. So there is supposedly no multi-wafer processing. I'm still studying their claims, but I haven't found an obvious hole yet.
I'm just going on what SemiAnalysis is saying, but it looks like they have to go multi-WSE for mid-sized models.

"This bandwidth constraint is what makes it difficult for Cerebras to serve larger parameter models. Any large tensors to be used must be resident on the wafer; streaming on/off the wafer is impossible with such a small amount of IO. Similarly, any sharding strategy that requires high-bandwidth collectives at each layer is categorically ruled out.

The only real option is pipeline parallelism, which slices the model layer-wise across wafers and only transfers activations between stages, relying on the fact that activations are small relative to weights. This reduces network requirements and keeps the capacity-demanding components (the weights, and to some extent the KV cache) stationary instead of moving on or off the wafer. For instance, Cerebras shards Llama3 70B across 4x WSE-3, transferring only the activations between each wafer and staying well within the available 1.2Tbps IO."
 
I'm just going on what SemiAnalysis is saying, but it looks like they have to go multi-WSE for mid-sized models.
I don't think SemiAnalysis knows what they're talking about. They clearly haven't read Cerebras's web pages with technical papers on what they do.
"This bandwidth constraint is what makes it difficult for Cerebras to serve larger parameter models. Any large tensors to be used must be resident on the wafer; streaming on/off the wafer is impossible with such a small amount of IO. Similarly, any sharding strategy that requires high-bandwidth collectives at each layer is categorically ruled out.
I think this is incorrect, unless Cerebras is just plain lying, which I doubt.
The only real option is pipeline parallelism, which slices the model layer-wise across wafers and only transfers activations between stages, relying on the fact that activations are small relative to weights. This reduces network requirements and keeps the capacity-demanding components (the weights, and to some extent the KV cache) stationary instead of moving on or off the wafer. For instance, Cerebras shards Llama3 70B across 4x WSE-3, transferring only the activations between each wafer and staying well within the available 1.2Tbps IO."
Sharding is exactly what Cerebras claims to do, and they claim not to use a shared KV cache.

One of these days I'll stop being so difficult and read the code. But this procrastination behavior is typical for me. Frankly, I don't find anything to do with LLMs technically interesting. In my professional past there were multiple areas which I just wasn't fascinated by, and it took a job assignment - which I usually resisted - before I learned some technologies that I thought were not very interesting. Block storage, NFS file systems, Ethernet/IP networking, and digital chip design come to mind. Chip design won me over, the others... still nauseating. ;) And now that I'm retired I don't have anyone threatening me, or offering raises and promotions.
 
I don't think SemiAnalysis knows what they're talking about. They clearly haven't read Cerebras's web pages with technical papers on what they do.
I think the benefit they have is time on with the actual hardware. But they might be seeing current software / model limitations as well.
 
The bandwidth advantage and disadvantage are clear to see, and I don't think anyone would dispute that **IF** the application cannot fit or be partitioned to fit nicely within one WSE, then the penalty is going to be extra steep.

Now, I am convinced there is a bandwidth constraint if the WSE border is breached; what I very much like to be enlightened is how "shard-able" or partitionable inference algorithms are generally. Since inferencing is a huge market, if inferencing can nicely fit into WSEs with little inter-WSE traffic, then their TAM may well justify their cap, and vice versa.
 
The bandwidth advantage and disadvantage are clear to see, and I don't think anyone would dispute that **IF** the application cannot fit or be partitioned to fit nicely within one WSE, then the penalty is going to be extra steep.
Your knowledge of parallel processing strategies and algorithms is incomplete. Decades ago a similar strategy for parallel processing with scale-out databases was called "shared nothing architecture". For training very large models, Cerebras says they can partition the training samples across multiple CS3 nodes, and the processing on each partition proceeds in isolation for each layer of the model:

During the forward propagation phase, as illustrated in Fig. 5:
  • Layer 1 is the first to be loaded onto the WSE. Input pre-processing servers handle the processing and streaming of training data to the WSE, while MemoryX streams the weights of layer 1 into the WSE. If there’s a request for multiple CS-X in the training job, SwarmX broadcasts the weights from MemoryX to all the WSEs. The batch of training samples is divided into equally sized subsets, or shards, with each shard directed to a respective CS-X. This technique is known as data parallelism.
  • Each WSE simultaneously performs the forward computation for layer 1 in parallel with the others.
  • The calculated activations for layer 1 are retained in the WSE’s memory.
  • Subsequently, MemoryX broadcasts the weights of layer 2 to the WSEs.
  • Each WSE conducts the forward computations for layer 2, utilizing its stored activations from layer 1.
    This same process is repeated for layer 3.
  • In this manner, the forward computation for each layer is carried out by employing the previously computed activations from the preceding layer. Additionally, the calculated activations for the current layer are stored in the WSE’s memory for use by the next loaded layer.
  • At the loss layer, the actual labels from the training data are employed to compute the network loss delta, which represents the gradient of the scalar loss concerning the activation of the output layer (layer 3). This loss delta plays a crucial role in calculating layer-by-layer deltas and weight gradients during the subsequent backward pass.

    1779197754293.png
Now, I am convinced there is a bandwidth constraint if the WSE border is breached; what I very much like to be enlightened is how "shard-able" or partitionable inference algorithms are generally. Since inferencing is a huge market, if inferencing can nicely fit into WSEs with little inter-WSE traffic, then their TAM may well justify their cap, and vice versa.
For inferencing, the use of clustering is apparently to increase throughput for a single model through replication, or to provide an inferencing service for multiple models.


 
Back
Top