The Most Expensive Question in AI Hardware Is the One Nobody Asks

Daniel Nenni · 2026-06-13T23:53:48-0700

Anshul Saxena

Product Strategy, Planning & Portfolio Management | Ecosystem Engagement | Engineering

June 11, 2026
Every silicon product definition I've been part of in the AI/HPC industry eventually arrives at the same slide: the process node decision. And every time, the room treats it like a foregone conclusion. Of course, we want the leading edge. Of course, smaller is better. Nobody gets fired for choosing fewer nanometers or angstroms.

It might have been true for the last couple of decades, but now I've come to believe this is the most expensive unexamined assumption in AI infrastructure today — especially for the people building the next wave of it: the AI gigafactories serving inference and Mixture-of-Experts workloads at scale, without a hyperscaler's balance sheet.

Let me make the case with a story the industry just told about itself.

The story starts with a benchmark that gave the game away. Last October, SemiAnalysis launched InferenceMAX — an open-source benchmark that does something refreshingly honest. Instead of quoting peak FLOPS, it runs nightly tests across leading chips and measures what operators actually pay for: tokens per second, tokens per watt, cost per million tokens, across real workloads and real latency targets.

The headline result was NVIDIA's victory lap. Blackwell delivered up to 15x the inference performance of the previous Hopper generation. Independent framework teams confirmed the magnitude in more conservative terms: roughly 4x throughput at similar latency on Llama 3.3 70B, and 4x on DeepSeek-R1 — a Mixture-of-Experts model — consistent across the entire latency-throughput curve. For power-limited AI factories, up to 10x more tokens per megawatt.

Extraordinary. Now here's the part that should reframe every roadmap conversation: Blackwell and Hopper sit on the same process node class. Both are TSMC 4nm-era silicon.

Look at the spec sheets and the plot thickens further. An H100 peaks at about 4 petaFLOPS of FP8. A B200 delivers about 9 — call it 2x, roughly what you'd expect from gluing two dies together. So where did 4x to 15x come from? From everywhere except lithography: a new FP4 number format that halves the bytes moved per parameter. Memory bandwidth jumping from 3.35 to 8 terabytes per second. NVLink doubling to 1.8 TB/s. Disaggregated serving that splits prefill from decode. Expert-parallel routing built for MoE. And months of kernel-level software work across TensorRT-LLM, vLLM, and SGLang that keeps improving the same silicon week after week.

The most successful chip company in history just demonstrated an order-of-magnitude gain — specifically in MoE inference — without shrinking a single transistor. That's not a footnote. That's the strategy.

And this matters even more if you're not Google. Here's the uncomfortable math that rarely makes it onto the slide. A 2nm wafer will cost roughly twice an N4-class wafer. But the wafer isn't even the real problem — the design is. Taking a chip to the bleeding edge means hundreds of millions in IP, EDA, masks, and tape-outs. NVIDIA amortizes that across an empire. A company building silicon for medium-scale AI infrastructure does not. At realistic volumes, the amortized design cost of a 2nm chip can exceed the cost of the silicon itself.

And what would that premium actually buy for the workloads gigafactories run? Inference is memory-bound: the chip spends its life streaming weights and KV cache, not doing arithmetic. MoE makes this more extreme, not less — enormous parameter counts held in memory, only a fraction activated per token, with the hard problems living in capacity, bandwidth, and the interconnect that routes tokens between experts. A 2nm compute tile accelerates the part that wasn't the bottleneck. It's paying a premium to make your traffic jam's fastest car faster.

The rational play — the one the whole industry is quietly converging on — is to spend advanced silicon only where it earns its keep. Chiplets: compute tiles on the most advanced node that volume justifies, I/O and analog on cheap mature processes where shrinks barely help anyway, everything bound by advanced packaging. It converts a bet-the-company node decision into a portfolio. Lower design cost, lower risk, and the option to upgrade one tile to the next generation instead of redesigning the system.

There is, however, a power twist to this story because AI gigafactories live and die by the megawatt. Grid connections are the scarcest resource in AI infrastructure today; electricity is a top-line operating cost rather than a footnote, and over a deployment's lifetime, energy can rival the hardware bill itself. Doesn't that argue for bleeding-edge efficiency?

It argues for efficiency. It doesn't argue for buying it at 2nm prices. Blackwell's 10x tokens-per-megawatt gain came from format, memory, interconnect, and software — the cheap levers. A medium-scale operator will exhaust their capital long before they exhaust those levers. The leading edge is what you reach for after the affordable efficiency is gone, and for inference, we are nowhere close.

So here is what I'd tell the roadmap room. If your product is designed for training frontier models against a hard power wall, keep buying the bleeding edge — that's rational, and it's also not most of us. For the fast-growing middle of this market, the priority stack is clear, and InferenceMAX just published the recipe:

Software and utilization first. Memory and interconnect second. Packaging and chiplets third. Nanometers last.

The next generation of AI infrastructure won't be won by whoever has the smallest transistors. It will be won by whoever wastes the fewest of them.