Make money doing the work you believe in

I compared NVFP4 to common 4-bit paths (AWQ, AutoRound, bitsandbytes) on an RTX 6000 Pro. Accuracy was broadly similar; in my runs, INT4 (AWQ/AutoRound) was slightly ahead of NVFP4/NVFP4A16 on some tasks.

NVFP4 models were larger than typical INT4 (around +7 GB for Llama 3.3), but throughput was the differentiator: with activation quantization, NVFP4 achieved about 2.35x the tokens/sec of INT4 on Blackwell.

Using NVFP4A16 (weights only) removed most of that speedup.Practical takeaway: if you’re on Blackwell and care primarily about inference speed, NVFP4 with activation quantization is a good default. If storage is tight or you want every last bit of accuracy, INT4 remains a solid option.

NVFP4: Same Accuracy with 2.3x Higher Throughput for 4-Bit LLMs
Aug 26
at
6:50 AM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.