Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/threads/inside-nvidia-groq-3-lpx-the-low-latency-inference-accelerator-for-the-nvidia-vera-rubin-platform.24775/
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2030970
            [XFI] => 1060170
        )

    [wordpress] => /var/www/html
)

Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform

user nl

Well-known member
Very informative paper with many details on Groq 3 LPX and the relation with (agentic) AI; many nice graphs that Jensen Huang also showed to some extent in the keynote:

https://developer.nvidia.com/blog/i...celerator-for-the-nvidia-vera-rubin-platform/

1773820304268.png

Introducing NVIDIA Groq 3 LPX​

Vera Rubin and LPX unite the extreme performance of Rubin GPUs and LPUs to deliver up to 35x higher inference throughput per megawatt and up to 10x more revenue opportunity for trillion-parameter models. Integrated with the NVIDIA MGX ETL rack architecture and aligned with the broader Vera Rubin platform, LPX gives data centers a way to deploy a dedicated low-latency inference path alongside Vera Rubin NVL72 within a common infrastructure design.

The system is built around 256 interconnected NVIDIA Groq 3 LPU accelerators. Its architecture emphasizes deterministic execution, high on-chip SRAM bandwidth, and tightly coordinated scale-up communication so interactive inference can stay responsive even as concurrency rises and request shapes vary.

Deployed alongside Vera Rubin NVL72, LPX accelerates the latency-sensitive portions of the decode loop, including FFN and MoE expert execution, while Rubin GPUs continue to handle prefill and decode attention. Together, they deliver a heterogeneous serving path that improves interactive responsiveness without sacrificing AI factory throughput.

1773821496555.webp
 
Last edited:
4GB is a lot of SRAM (is it supposed to be 4 Gb)?

NVIDIA developer blog says it's 500MB. So that means 4GB per unit platform(rack server), which happened to be 4Gb per chip anyway.
 

NVIDIA developer blog says it's 500MB. So that means 4GB per unit platform(rack server), which happened to be 4Gb per chip anyway.
784 billion transistors over 8 chips makes more sense.
 
This article points out that the Groq solution appears to have displaced Rubin CPX (announced 2025) from the hardware stack:


As revealed by NVIDIA only back in September of 2025, Rubin CPX would be a GDDR7-backed Rubin GPU that would go into Rubin Vera NVL72 racks to handle the decode phase of token generation – the same role that Gorq’s LPUs are being employed for now.

When asked about the future of Rubin CPX in a press Q&A session, NVIDIA’s answer more or less discounted Rubin CPX entirely. According to company representatives, NVIDIA is focusing on integrating LPUs (and the LPX rack) into the Vera Rubin platform to optimize decode, and that is it.
 
This article points out that the Groq solution appears to have displaced Rubin CPX (announced 2025) from the hardware stack:


As revealed by NVIDIA only back in September of 2025, Rubin CPX would be a GDDR7-backed Rubin GPU that would go into Rubin Vera NVL72 racks to handle the decode phase of token generation – the same role that Gorq’s LPUs are being employed for now.

When asked about the future of Rubin CPX in a press Q&A session, NVIDIA’s answer more or less discounted Rubin CPX entirely. According to company representatives, NVIDIA is focusing on integrating LPUs (and the LPX rack) into the Vera Rubin platform to optimize decode, and that is it.

Curiously, NVIDIA has not disclosed what the host CPU is at this time, though they have disclosed that it will have (up to) 128GB of DRAM attached to it. Patrick looked at this photo during the GTC keynote and immediately saw that the host CPU has a retention mechanism only employed by 4th Gen, 5th Gen, and Intel Xeon 6 CPUs.


Seems to match with other INTEL news on NVIDIA collaboration. I guess now the remarks by CEO&CFO of INTEL that there was a sudden sharp demand for CPUs makes even more sense?

https://semiwiki.com/forum/threads/...-cpus-in-nvidia-dgx-rubin-nvl8-systems.24755/
 
This article points out that the Groq solution appears to have displaced Rubin CPX (announced 2025) from the hardware stack:


As revealed by NVIDIA only back in September of 2025, Rubin CPX would be a GDDR7-backed Rubin GPU that would go into Rubin Vera NVL72 racks to handle the decode phase of token generation – the same role that Gorq’s LPUs are being employed for now.

When asked about the future of Rubin CPX in a press Q&A session, NVIDIA’s answer more or less discounted Rubin CPX entirely. According to company representatives, NVIDIA is focusing on integrating LPUs (and the LPX rack) into the Vera Rubin platform to optimize decode, and that is it.
From read Ian Cuttess - smart guy - Rubin CPX is for the prefill phase -- and the LPU is for the decode phase. Read his article on it: https://morethanmoore.substack.com/p/nvidia-introduces-groq-lp30-and-lpx
 
Back
Top