Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/index.php?threads/interesting-samsung-14nm-vs-tsmc-4nm-function-comparison-from-gorq-in-sff2024.20412/
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021370
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

Interesting Samsung 14nm vs. tsmc 4nm function comparison from Gorq in SFF2024

hskuo

Well-known member
Just wondering if my reading is correct or not. Does it read gorq AI interface chip demonstrate Samsung 14LPP performance is 6x faster, 4x cheaper and 1/3 less energy than tsmc 4nm?
Apple-to-apple comparison? If that is, then Samsung 14LPP is so good and no need to move to the next node.
1718247209301.png
 
Last edited:
AFAIK, they decided to use Samsung Taylor fab. Since they're designing DRAM-less(SRAM only) inference chip, no wonder it's faster than their competitions. So basically they're emphasizing their design superiority, not comparing foundries.
 
It's TSMC N4 not 4N. 😅

From what I was told GROQ pivoted and no longer sells chips, they are a cloud company now (GroqCloud) so that is a big fat $300M+ fail.
 
It's TSMC N4 not 4N. 😅

From what I was told GROQ pivoted and no longer sells chips, they are a cloud company now (GroqCloud) so that is a big fat $300M+ fail.

The current version of Groq AI processors are manufactured by Globalfoundries using its 14nm node technology. Groq's next generation AI processors will be made at Samsung's new Taylor Texas fab with Samsung's SF4X process (4nm). If Groq on Globalfoundries' 14nm is so superior already, can we image how far Groq will be ahead of all its competitors once it starts using Samsung 4nm!


Do I miss something here?
 
AFAIK, they decided to use Samsung Taylor fab. Since they're designing DRAM-less(SRAM only) inference chip, no wonder it's faster than their competitions. So basically they're emphasizing their design superiority, not comparing foundries.

It’s very misleading to treat this as a process comparison, when the big difference is architecture. Groq went with an SRAM-only, batch size = 1 approach, that only allows a single user to use inference from the loaded model at a time, so very responsive and performant, and cheap on a per performance basis, but very expensive on a per user basis. I think that‘s why they had to pivot into CSP (cloud service provider) business - none of the existing platform suppliers or CSPs could figure out how to build and sell batch size = 1 systems or cloud services. The example I heard was $11M in hardware to run ChatGPT (largest parameter version) for a single user at a time. True, it might give the one user / user context blindingly fast results, with no latency. They might be able to figure out how to build and operate a cloud that serves many users cost effectively, but it’s going to need radically different provisioning and management infrastructure. And time and experience.
 
Last edited:
It’s very misleading to treat this as a process comparison, when the big difference is architecture. Groq went with an SRAM-only, batch size = 1 approach, that only allows a single user to use inference from the loaded model at a time, so very responsive and performant, and cheap on a per performance basis, but very expensive on a per user basis. I think that‘s why they had to pivot into CSP (cloud service provider) business - none of the existing platform suppliers or CSPs could figure out how to build and sell batch size = 1 systems or cloud services. The example I heard was $11M in hardware to run ChatGPT (largest parameter version) for a single user at a time. True, it might give the one user / user context blindingly fast results, with no latency. They might be able to figure out how to build and operate a cloud that serves many users cost effectively, but it’s going to need radically different provisioning and management infrastructure. And time and experience.

I remember some material saying that Groq's benchmark has 3 concurrent users(full rack of servers). In many cases, DRAM + GPU combination supports more than 100 concurrent users(Single 4U rack). Groq might have some oppurtunities if there are applications who wants blazing fast language model latency or tons of idle capacity in foundry who are willing to discount A LOT.
 
Groq might have some oppurtunities if there are applications who wants blazing fast language model latency or tons of idle capacity in foundry who are willing to discount A LOT.
Yes, there are definitely applications where single-user real-time inference is crucial, like ADAS, but as you suggest, the price point might be tricky today.

It’s also interesting to see how Groq, Cerebras and even NVIDIA are trying to completely reshape data center servers to offer what might be far better value propositions for Gen AI apps, than existing servers/data centers. At the same time, the hyperscalers/CSPs are building chips that accelerate Gen AI, at least for their current internal workloads, within their existing infrastructure. And Intel and AMD are trying to straddle current CSP external workloads and new Gen AI workloads with their offerings (Xeon plus Gaudi, Epyc plus MI300), though it’s not clear that they have the “new” hybrid server architecture figured out.
 
Just to clear some stuff up here...

Groq's LPU v1 was a compiler first design, and is a deterministic processor. The idea is that you know where your data and instructions are on any given clock cycle because it's all traceable at compile time. When they started, they went after the batch 1 market because their SLAs were amazingly good - every batch 1 inference took the same amount of time, regardless. In the days before transformers, when the market was looking like image recognition models / CNN derivatives, it looked amazing.

They then pivoted hard to LLMs. With no DRAM or HBM, they relied on 230 MB of SRAM per chip. This means you need 10 racks or 570-odd chips to run some good 70B models at FP16 (correct me if I'm wrong, I don't think the chip does INT8). Transformers/LLMs don't really need batch 1, but they went ahead anyway. Done some amazing marketing, and when speaking to Jon, they're standing up three customer clouds this month, the biggest being 1.7 million tokens/sec (that's combined across all users). They went hard on the 300 tokens/sec last year on LPU v1 with Llama2-7B and now they're up to 1300 tokens/sec, showing that the software has a way to go.

Don't forget, comparatively speaking, GF14 is cheap here, and why a lot of AI startups have been using it for 1st/2nd gen chips. I suspect if Samsung wants to leverage Groq as an AI/HPC peak customer, there might be some good deal on SF4X to help with co-marketing. They only just sent of the final RTL (GDSII?) to the fab in the past few weeks, Jon said on stage. That means it's being built in Korea - they'll get silicon back in a few months and then customers next year. It may pivot to Taylor over time, but Taylor's not ready (afaik).

I said most of this all on twitter already, but here's some more numbers.

Groq expects to ship/standup :
- 2024 - 0.1 M chips this year
- 2025 - 1.2 M chips next year
That's all LPU v1 numbers - nothing about v2.

Given Jon's history of being part of the TPU crowd, and actually the Groq chip 'is a good chip but it's a shame transformers came', I suspect the v2 will be more aligned with industry requirements. It will be interesting if they've kept the deterministic nature of the processor though.

I'll be posting my write up from the Samsung event on Monday, you'll find it on my substack.
 
@IanCutress, great analysis. Thanks for the insights and for the forthcoming Samsung write-up. Didn’t realize that v1 was aimed at a limited range of inference. Will be interesting to see v2 with transformers. But current transformer structure is probably not the end of neural network evolution. We might even see networks that restructure during training into forms we’re not even thinking of today, kind of like a young human brain.
 
@IanCutress, great analysis. Thanks for the insights and for the forthcoming Samsung write-up. Didn’t realize that v1 was aimed at a limited range of inference. Will be interesting to see v2 with transformers. But current transformer structure is probably not the end of neural network evolution. We might even see networks that restructure during training into forms we’re not even thinking of today, kind of like a young human brain.
I spoke with Lamini last week - they released a paper about reducing hallucinations, enabling 95% accuracy. (The marketing said 10x fewer hallucinations, I suspect that's because wider audience isn't comfortable with percentages?)
In order to do this they create a 1-million-way Mixture of Memory Experts using a LORA-like method, reducing the loss function over specific domain knowledge.
Note, MoME is supposedly pronounced mommy. Do with that what you will.

Shameless plug, again: https://morethanmoore.substack.com/p/how-to-solve-llm-hallucinations here is my newsletter post on it. Alternatively https://www.lamini.ai/blog/lamini-memory-tuning is their blog post.
Clarification: I'm not paid by Lamini. I just thought this was really cool!

However we are straying away from a pure matmul math - transformers was one stage, MoME could be another. I'm seeing the new architectures focus a lot on bringing special function units into each matrix core, simply because we're still playing around with custom operators (think ReLU, but a lot more complex).
 
I spoke with Lamini last week - they released a paper about reducing hallucinations, enabling 95% accuracy.
Interesting technology. I’m a little bit leery of their accuracy numbers for LLMs especially with some of the newer MoE models. From what I can tell, based on the very recent focus of NVIDIA, AMD and Intel, the fastest growth for early Gen AI applications is LLM + RAG, where the RAG is used for on-prem proprietary data / resources. The good news is that there are lots of startups offering LLM+RAG platforms that have also developed innovative techniques for driving accuracy to the 95-98% zone. Most use multiple models, synthetic chats / prompts, and advanced tuning methodology (RHLF, DPO, LoRA) to get rid of hallucinations plus make responses and conversation incredibly context-aware.
 
I spoke with Lamini last week - they released a paper about reducing hallucinations,...

Shameless plug, again: https://morethanmoore.substack.com/p/how-to-solve-llm-hallucinations here is my newsletter post on it. Alternatively https://www.lamini.ai/blog/lamini-memory-tuning is their blog post.
Clarification: I'm not paid by Lamini. I just thought this was really cool!

However we are straying away from a pure matmul math - transformers was one stage, MoME could be another. I'm seeing the new architectures focus a lot on bringing special function units into each matrix core, simply because we're still playing around with custom operators (think ReLU, but a lot more complex).
It was a good MtM post.

These new functional units like MoME, or another hot one Mixture of Depths, all use matmul on large matrix/tensors as the basic building block. I suspect Groq can figure that out as a scheduled flow too.

Groq's main problem is the bet on SRAM. Taylor's group at UW analyzed how that scales: https://arxiv.org/abs/2307.02666. On paper it is feasible but only if you have very high throughput rates, since the minimum size machine costs a lot and goes BRRRR...
That makes it an outlier that few people may have a use for.
 
Back
Top