Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform

user nl · Mar 18, 2026

Very informative paper with many details on Groq 3 LPX and the relation with (agentic) AI; many nice graphs that Jensen Huang also showed to some extent in the keynote:

https://developer.nvidia.com/blog/i...celerator-for-the-nvidia-vera-rubin-platform/

Introducing NVIDIA Groq 3 LPX

Vera Rubin and LPX unite the extreme performance of Rubin GPUs and LPUs to deliver up to 35x higher inference throughput per megawatt and up to 10x more revenue opportunity for trillion-parameter models. Integrated with the NVIDIA MGX ETL rack architecture and aligned with the broader Vera Rubin platform, LPX gives data centers a way to deploy a dedicated low-latency inference path alongside Vera Rubin NVL72 within a common infrastructure design.

The system is built around 256 interconnected NVIDIA Groq 3 LPU accelerators. Its architecture emphasizes deterministic execution, high on-chip SRAM bandwidth, and tightly coordinated scale-up communication so interactive inference can stay responsive even as concurrency rises and request shapes vary.

Deployed alongside Vera Rubin NVL72, LPX accelerates the latency-sensitive portions of the decode loop, including FFN and MoE expert execution, while Rubin GPUs continue to handle prefill and decode attention. Together, they deliver a heterogeneous serving path that improves interactive responsiveness without sacrificing AI factory throughput.

Fred Chen · Mar 18, 2026

4GB is a lot of SRAM (is it supposed to be 4 Gb)?

From: https://www.cna.com.tw/news/ait/202603170011.aspx

semiman · Mar 18, 2026

Fred Chen said:
4GB is a lot of SRAM (is it supposed to be 4 Gb)?

Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform | NVIDIA Technical Blog

NVIDIA Groq 3 LPX is a new rack-scale inference accelerator for the NVIDIA Vera Rubin platform, designed for the low-latency and large-context demands of agentic systems. Co-designed with the NVIDIA…

developer.nvidia.com

NVIDIA developer blog says it's 500MB. So that means 4GB per unit platform(rack server), which happened to be 4Gb per chip anyway.

Fred Chen · Mar 18, 2026

semiman said:
Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform | NVIDIA Technical Blog

NVIDIA Groq 3 LPX is a new rack-scale inference accelerator for the NVIDIA Vera Rubin platform, designed for the low-latency and large-context demands of agentic systems. Co-designed with the NVIDIA…

developer.nvidia.com

NVIDIA developer blog says it's 500MB. So that means 4GB per unit platform(rack server), which happened to be 4Gb per chip anyway.

784 billion transistors over 8 chips makes more sense.

Xebec · Mar 18, 2026

This article points out that the Groq solution appears to have displaced Rubin CPX (announced 2025) from the hardware stack:

Decoding the Future of Inference At NVIDIA: Groq LPUs Join Vera Rubin Platform For Low-Latency Inference

With its upcoming Vera Rubin rackscale architecture, NVIDIA is going to be integrating LPUs from acquihire Groq, marking a major expansion beyond using GPUs alone for AI inference

www.servethehome.com

As revealed by NVIDIA only back in September of 2025, Rubin CPX would be a GDDR7-backed Rubin GPU that would go into Rubin Vera NVL72 racks to handle the decode phase of token generation – the same role that Gorq’s LPUs are being employed for now.

When asked about the future of Rubin CPX in a press Q&A session, NVIDIA’s answer more or less discounted Rubin CPX entirely. According to company representatives, NVIDIA is focusing on integrating LPUs (and the LPX rack) into the Vera Rubin platform to optimize decode, and that is it.

FIvr · Mar 18, 2026

32 trays per rack, 128GB total; 8 chips per tray, 4GB total; 500MB per chip

user nl · Mar 18, 2026

Xebec said:
This article points out that the Groq solution appears to have displaced Rubin CPX (announced 2025) from the hardware stack:

Decoding the Future of Inference At NVIDIA: Groq LPUs Join Vera Rubin Platform For Low-Latency Inference

With its upcoming Vera Rubin rackscale architecture, NVIDIA is going to be integrating LPUs from acquihire Groq, marking a major expansion beyond using GPUs alone for AI inference

www.servethehome.com

As revealed by NVIDIA only back in September of 2025, Rubin CPX would be a GDDR7-backed Rubin GPU that would go into Rubin Vera NVL72 racks to handle the decode phase of token generation – the same role that Gorq’s LPUs are being employed for now.

When asked about the future of Rubin CPX in a press Q&A session, NVIDIA’s answer more or less discounted Rubin CPX entirely. According to company representatives, NVIDIA is focusing on integrating LPUs (and the LPX rack) into the Vera Rubin platform to optimize decode, and that is it.

Curiously, NVIDIA has not disclosed what the host CPU is at this time, though they have disclosed that it will have (up to) 128GB of DRAM attached to it. Patrick looked at this photo during the GTC keynote and immediately saw that the host CPU has a retention mechanism only employed by 4th Gen, 5th Gen, and Intel Xeon 6 CPUs.

Seems to match with other INTEL news on NVIDIA collaboration. I guess now the remarks by CEO&CFO of INTEL that there was a sudden sharp demand for CPUs makes even more sense?

https://semiwiki.com/forum/threads/...-cpus-in-nvidia-dgx-rubin-nvl8-systems.24755/

Sølve Folkestad Dahl · Mar 18, 2026

Xebec said:
This article points out that the Groq solution appears to have displaced Rubin CPX (announced 2025) from the hardware stack:

Decoding the Future of Inference At NVIDIA: Groq LPUs Join Vera Rubin Platform For Low-Latency Inference

With its upcoming Vera Rubin rackscale architecture, NVIDIA is going to be integrating LPUs from acquihire Groq, marking a major expansion beyond using GPUs alone for AI inference

www.servethehome.com

As revealed by NVIDIA only back in September of 2025, Rubin CPX would be a GDDR7-backed Rubin GPU that would go into Rubin Vera NVL72 racks to handle the decode phase of token generation – the same role that Gorq’s LPUs are being employed for now.

When asked about the future of Rubin CPX in a press Q&A session, NVIDIA’s answer more or less discounted Rubin CPX entirely. According to company representatives, NVIDIA is focusing on integrating LPUs (and the LPX rack) into the Vera Rubin platform to optimize decode, and that is it.

From read Ian Cuttess - smart guy - Rubin CPX is for the prefill phase -- and the LPU is for the decode phase. Read his article on it: https://morethanmoore.substack.com/p/nvidia-introduces-groq-lp30-and-lpx

Search

Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform

user nl

Well-known member

Introducing NVIDIA Groq 3 LPX

Fred Chen

Moderator

semiman

Active member

Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform | NVIDIA Technical Blog

Fred Chen

Moderator

Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform | NVIDIA Technical Blog

Xebec

Well-known member

Decoding the Future of Inference At NVIDIA: Groq LPUs Join Vera Rubin Platform For Low-Latency Inference

FIvr

New member

user nl

Well-known member

Decoding the Future of Inference At NVIDIA: Groq LPUs Join Vera Rubin Platform For Low-Latency Inference

Sølve Folkestad Dahl

New member

Decoding the Future of Inference At NVIDIA: Groq LPUs Join Vera Rubin Platform For Low-Latency Inference

Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform

Well-known member

Introducing NVIDIA Groq 3 LPX​

Moderator

Active member

Moderator

Well-known member

New member

Well-known member

New member

Introducing NVIDIA Groq 3 LPX