You are currently viewing SemiWiki as a guest which gives you limited access to the site. To view blog comments and experience other SemiWiki features you must be a registered member. Registration is fast, simple, and absolutely free so please, join our community today!
That’s why I question folks proposing lower level standards - I can understand it for HPC problems, but not for GenAI inference. I think the big competitive battle going on right now is going to be about inference cost / power per token at the data center level, for every leading model. The good news is that Llama has been added to MLPerf 5.0. The bad news is that the focus is still on performance, so they aren’t looking at cost/power per token yet.
That’s why I question folks proposing lower level standards - I can understand it for HPC problems, but not for GenAI inference. I think the big competitive battle going on right now is going to be about inference cost / power per token at the data center level, for every leading model. The good news is that Llama has been added to MLPerf 5.0. The bad news is that the focus is still on performance, so they aren’t looking at cost/power per token yet.
AMD has thrown so much Hardware at the problem it's hilarious just to loose to H100 as for low level you know that part of deepseks success was low level PTX assembly they utilized hardware fully.
Who knows ? Random commissioned benchmarks are a bit meaningless, especially without full disclosure of comparative environments. I’m thinking that MLPerf 5.0 is a far more reliable and trustworthy, transparent comparison. Unfortunately Intel has only done the hard work for Granite Rapids, not Gaudi 3 (yet) for the new LLM Llama benchmarks.
AMD has thrown so much Hardware at the problem it's hilarious just to loose to H100 as for low level you know that part of deepseks success was low level PTX assembly they utilized hardware fully.
Yeah, but most of DeepSeek’s efficiency magic can be duplicated via smarter data center orchestration that does GPU planning, prefill/decode disaggregation, smart KV cache management / routing and communication between GPUs. Guess who has figured out how to make that work (hint 30x tokens/sec improvement on DeepSeek-R1)
Who knows ? Random commissioned benchmarks are a bit meaningless, especially without full disclosure of comparative environments. I’m thinking that MLPerf 5.0 is a far more reliable and trustworthy, transparent comparison. Unfortunately Intel has only done the hard work for Granite Rapids, not Gaudi 3 (yet) for the new LLM Llama benchmarks.
Maybe latency is not the right word. The throughputs of using CPU for inferencing are limited in comparison to accelerators. But it can produce the same outputs given enough time.