is a rack-scale low-latency inference accelerator designed for the NVIDIA Vera Rubin platform.
Designed for the low-latency and large-context demands of agentic systems.
Vera Rubin and LPX unite the extreme performance of Rubin GPUs and LPUs to deliver up to 35x higher inference throughput per megawatt and up to 10x more revenue opportunity for trillion-parameter models.
For those of you who like Specs:
Rack-Scale System Specs (NVIDIA Groq 3 LPX)
Number of accelerators: 256 interconnected Groq 3 LPU accelerators (also called LP30 chips)
AI inference compute: 315 PFLOPS (FP8)
Total on-chip SRAM capacity: 128 GB
On-chip SRAM bandwidth: 40 PB/s (petabytes per second)
Scale-up bandwidth: 640 TB/s
Scale-up density: 256 chips
Additional memory: Up to 12 TB DDR5 per rack (via fabric/host expansion for larger models)
Design: Fully liquid-cooled, cableless 1U trays (32 trays per rack, 8 LPUs per tray), built on NVIDIA MGX infrastructure
Key performance claims (when paired with Vera Rubin NVL72):
Up to 35x higher inference throughput per megawatt
Up to 10x more revenue opportunity for trillion-parameter models
Fast, predictable token generation approaching 1,000 tokens/second per user
Supports speculative decoding, long-context processing, and high-concurrency interactive inference with stable low latency (time-to-first-token and per-token)
Mar 17
at
12:00 PM
Relevant people
Log in or sign up
Join the most interesting and insightful discussions.