Make money doing the work you believe in

I quantized LFM2.5 into several 4-bit and 8-bit variants for fast inference in vLLM. All models were tested on vLLM 0.13 with an RTX 4090.

  • FP8: Great speed with minimal accuracy loss if you’re on a recent Ada / Hopper / Blackwell GPU.

  • NVFP4: If you have a Blackwell GPU, this should be the fastest option. Expect a more noticeable accuracy drop due to 4-bit activations (full evals soon).

  • GPTQ / AWQ (4-bit): Strong expected accuracy and broad hardware compatibility.

  • AutoRound (4-bit): Likely the best overall 4-bit, and should work on most GPUs.

I’ll run a full evaluation later this week: pass@k curves, best hyperparameters, benchmark accuracy, token efficiency, etc.

The models are here:

Jan 6
at
4:20 PM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.