I think this data center AI inference benchmark update highlights what's happening - data center / model co-optimization. Can't wait for system-level additions of Amazon, Google(TPU) and Cerebras.
1. What InferenceX v2 adds
• Benchmarks now cover disaggregated prefill + wide expert parallelism (wideEP) for DeepSeek‑style MoE across essentially all recent NVIDIA and AMD SKUs, including GB300 NVL72, B300, B200, H100, and MI355X.
• It is the first third‑party suite to fully benchmark GB300/GB200 NVL72 and multi‑node FP4/FP8 MI355X in these production‑like configurations.
2. NVIDIA’s position (Hopper → Blackwell)
• For state‑of‑the‑art stacks (disagg prefill + wideEP + FP4), NVIDIA “framemogs” with B200/B300 and especially rack‑scale GB200/GB300 NVL72, both on SGLang and TensorRT‑LLM (TRT‑LLM).
• Blackwell NVL72 achieves up to roughly 100× throughput vs a strong H100 disagg+wideEP FP8 baseline at high interactivity, and 9.7–65× better tokens‑per‑dollar vs Hopper depending on latency.
• Wide EP across a 72‑GPU NVLink domain massively improves weight‑loading efficiency and amortization for huge MoEs like DeepSeek R1; TP+EP hybrids are used to balance load at different concurrencies.
• GB200 NVL72 and GB300 NVL72 get their advantage from very high intra‑rack NVLink bandwidth; at very high interactivity (tiny batches) the benefit shrinks because workloads stay within a single 8‑GPU node.
3. AMD’s position and “composability” problem
• On FP8 disaggregated inference with SGLang, MI355X (with MoRI) is competitive with B200 SGLang disagg across parts of the Pareto curve, especially at mid interactivities.
• Single‑node FP4 on AMD is decent, and MI355X can be up to ~10× faster than MI300X with much better cost for DeepSeek R1 FP8.
• But when you compose FP4 + disagg prefill + wideEP—the configuration top labs actually use—MI355X falls apart: FP4 disagg+wideEP perf is subpar and gets “mogged” by B200, especially once you include Dynamo+TRT‑LLM.
• Root cause called out: poor composability of AMD’s optimizations and lagging ROCm/vLLM/SGLang ecosystem (old vLLM forks, limited CI hardware, fewer upstream contributors and reviewers).
4. Disaggregated frameworks and KV
• NVIDIA: uses Dynamo as a disaggregated inference framework, with prefill/decode separation, routing, and KV cache offload; it is engine‑agnostic and supports SGLang and TRT‑LLM backends.
• AMD: uses SGLang plus MoRI and Mooncake; MoRI handles MoE routing and KV transfer, Mooncake integrates into PyTorch for PD disaggregation and fault‑tolerant multi‑node, but both are still maturing.
• GB200 Dynamo TRT‑LLM disagg shows ~20%+ throughput gains over a month as wideEP kernels and stack mature, highlighting NVIDIA’s fast software iteration.
5. Economics and provider margins
• Using InferenceX metrics and OpenRouter prices for DeepSeek R1, they estimate high‑end providers (e.g., Crusoe on H200‑class) can achieve very large gross margins (tens of percent to >80%) at realistic interactivity (~35 tok/s/user) when using MTP + disagg + wideEP.
• At high interactivity (e.g., 125–167 tok/s/user), multi‑token prediction becomes essential to keep cost per token reasonable, and Blackwell configs with MTP are consistently cheapest.
6. Meta: Blackwell Ultra, AMD ATOM, and roadmap
• Blackwell Ultra already shows up to 1.5× FP8 gains over Blackwell in practice, though FP4 gains are smaller so far, likely due to immature software.
• AMD’s new ATOM engine slightly improves single‑node perf but lacks critical features (NVMe/CPU offload, tools, wideEP, disagg), so essentially nobody uses it in production versus TRT‑LLM.
• AMD is at least six months behind on open‑source distributed inference and wideEP FP4 composability, but MI455X rack‑scale systems are due in low volume in H2 2026, with real production not until 2027.
Net: the update reinforces that “real” frontier inference is now defined by disaggregated prefill, wide MoE expert parallelism, and FP4/FP8 stacks; Blackwell NVL72 plus Dynamo/TRT‑LLM is in the lead there, while AMD has proved hardware potential (especially MI355X FP8) but must fix software composability and upstream ecosystem engagement to close the gap.
newsletter.semianalysis.com
1. What InferenceX v2 adds
• Benchmarks now cover disaggregated prefill + wide expert parallelism (wideEP) for DeepSeek‑style MoE across essentially all recent NVIDIA and AMD SKUs, including GB300 NVL72, B300, B200, H100, and MI355X.
• It is the first third‑party suite to fully benchmark GB300/GB200 NVL72 and multi‑node FP4/FP8 MI355X in these production‑like configurations.
2. NVIDIA’s position (Hopper → Blackwell)
• For state‑of‑the‑art stacks (disagg prefill + wideEP + FP4), NVIDIA “framemogs” with B200/B300 and especially rack‑scale GB200/GB300 NVL72, both on SGLang and TensorRT‑LLM (TRT‑LLM).
• Blackwell NVL72 achieves up to roughly 100× throughput vs a strong H100 disagg+wideEP FP8 baseline at high interactivity, and 9.7–65× better tokens‑per‑dollar vs Hopper depending on latency.
• Wide EP across a 72‑GPU NVLink domain massively improves weight‑loading efficiency and amortization for huge MoEs like DeepSeek R1; TP+EP hybrids are used to balance load at different concurrencies.
• GB200 NVL72 and GB300 NVL72 get their advantage from very high intra‑rack NVLink bandwidth; at very high interactivity (tiny batches) the benefit shrinks because workloads stay within a single 8‑GPU node.
3. AMD’s position and “composability” problem
• On FP8 disaggregated inference with SGLang, MI355X (with MoRI) is competitive with B200 SGLang disagg across parts of the Pareto curve, especially at mid interactivities.
• Single‑node FP4 on AMD is decent, and MI355X can be up to ~10× faster than MI300X with much better cost for DeepSeek R1 FP8.
• But when you compose FP4 + disagg prefill + wideEP—the configuration top labs actually use—MI355X falls apart: FP4 disagg+wideEP perf is subpar and gets “mogged” by B200, especially once you include Dynamo+TRT‑LLM.
• Root cause called out: poor composability of AMD’s optimizations and lagging ROCm/vLLM/SGLang ecosystem (old vLLM forks, limited CI hardware, fewer upstream contributors and reviewers).
4. Disaggregated frameworks and KV
• NVIDIA: uses Dynamo as a disaggregated inference framework, with prefill/decode separation, routing, and KV cache offload; it is engine‑agnostic and supports SGLang and TRT‑LLM backends.
• AMD: uses SGLang plus MoRI and Mooncake; MoRI handles MoE routing and KV transfer, Mooncake integrates into PyTorch for PD disaggregation and fault‑tolerant multi‑node, but both are still maturing.
• GB200 Dynamo TRT‑LLM disagg shows ~20%+ throughput gains over a month as wideEP kernels and stack mature, highlighting NVIDIA’s fast software iteration.
5. Economics and provider margins
• Using InferenceX metrics and OpenRouter prices for DeepSeek R1, they estimate high‑end providers (e.g., Crusoe on H200‑class) can achieve very large gross margins (tens of percent to >80%) at realistic interactivity (~35 tok/s/user) when using MTP + disagg + wideEP.
• At high interactivity (e.g., 125–167 tok/s/user), multi‑token prediction becomes essential to keep cost per token reasonable, and Blackwell configs with MTP are consistently cheapest.
6. Meta: Blackwell Ultra, AMD ATOM, and roadmap
• Blackwell Ultra already shows up to 1.5× FP8 gains over Blackwell in practice, though FP4 gains are smaller so far, likely due to immature software.
• AMD’s new ATOM engine slightly improves single‑node perf but lacks critical features (NVMe/CPU offload, tools, wideEP, disagg), so essentially nobody uses it in production versus TRT‑LLM.
• AMD is at least six months behind on open‑source distributed inference and wideEP FP4 composability, but MI455X rack‑scale systems are due in low volume in H2 2026, with real production not until 2027.
Net: the update reinforces that “real” frontier inference is now defined by disaggregated prefill, wide MoE expert parallelism, and FP4/FP8 stacks; Blackwell NVL72 plus Dynamo/TRT‑LLM is in the lead there, while AMD has proved hardware potential (especially MI355X FP8) but must fix software composability and upstream ecosystem engagement to close the gap.
InferenceX v2: NVIDIA Blackwell Vs AMD vs Hopper - Formerly InferenceMAX
The Artist Known as InferenceMAX. GB300 NVL72, MI355X, B200, H100, Disaggregated Serving, Wide Expert Parallelism, Large Mixture of Experts, SGLang, vLLM, TRTLLM
Last edited:
