Benjamin Marie (@bnjmnmarie): "I compared NVFP4 to common 4-bit paths (AWQ, AutoRound, bitsandbytes) on an RTX 6000 Pro. Accuracy was broadly similar; in my runs, INT4 (AWQ/AutoRound) was slightly ahead of NVFP4/NVFP4A16 on some tasks. NVFP4 models were larger than typical INT4 (around +7 GB for Llama 3.3),…"

Make money doing the work you believe in

Aug 26, 2025

I compared NVFP4 to common 4-bit paths (AWQ, AutoRound, bitsandbytes) on an RTX 6000 Pro. Accuracy was broadly similar; in my runs, INT4 (AWQ/AutoRound) was slightly ahead of NVFP4/NVFP4A16 on some tasks.

NVFP4 models were larger than typical INT4 (around +7 GB for Llama 3.3), but throughput was the differentiator: with activation quantization, NVFP4 achieved about 2.35x the tokens/sec of INT4 on Blackwell.

Using NVFP4A16 (weights only) removed most of that speedup.Practical takeaway: if you’re on Blackwell and care primarily about inference speed, NVFP4 with activation quantization is a good default. If storage is tight or you want every last bit of accuracy, INT4 remains a solid option.

The Kaitchup – AI on a Budget

NVFP4: Same Accuracy with 2.3x Higher Throughput for 4-Bit LLMs

Aug 26

6:50 AM

Make money doing the work you believe in

Log in or sign up