Benjamin Marie (@bnjmnmarie): "I quantized LFM2.5 into several 4-bit and 8-bit variants for fast inference in vLLM. All models were tested on vLLM 0.13 with an RTX 4090. FP8: Great speed with minimal accuracy loss if you’re on a recent Ada / Hopper / Blackwell GPU. NVFP4: If you have a Blackwell GPU, …"

I quantized LFM2.5 into several 4-bit and 8-bit variants for fast inference in vLLM. All models were tested on vLLM 0.13 with an RTX 4090.

FP8: Great speed with minimal accuracy loss if you’re on a recent Ada / Hopper / Blackwell GPU.
NVFP4: If you have a Blackwell GPU, this should be the fastest option. Expect a more noticeable accuracy drop due to 4-bit activations (full evals soon).
GPTQ / AWQ (4-bit): Strong expected accuracy and broad hardware compatibility.
AutoRound (4-bit): Likely the best overall 4-bit, and should work on most GPUs.

I’ll run a full evaluation later this week: pass@k curves, best hyperparameters, benchmark accuracy, token efficiency, etc.

The models are here:

huggingface.co

Verified models. Compatible with vLLM.

Jan 6

4:20 PM