Benjamin Marie (@bnjmnmarie): "If you're working on low-bit inference for LLMs, the new QuTLASS library is worth checking out. It’s built on top of CUTLASS and designed specifically for 4-bit (MXFP4) inference on NVIDIA Blackwell GPUs. What makes it interesting is how it taps into microscaling, which is a ne…"

Make money doing the work you believe in

Jul 16, 2025

If you're working on low-bit inference for LLMs, the new QuTLASS library is worth checking out.

It’s built on top of CUTLASS and designed specifically for 4-bit (MXFP4) inference on NVIDIA Blackwell GPUs. What makes it interesting is how it taps into microscaling, which is a new Blackwell feature that applies fine-grained scale factors along the inner dimension of GEMMs (every 32 elements, for example).

QuTLASS supports fused Hadamard transforms + quantization, online scale computation, and comes with both CUTLASS-style and custom kernels (optimized for small batch sizes). It supports multiple quantization modes and is plug-and-play with rotation matrices at runtime.

Benchmarks show up to 4x speedup over BF16 in LLM inference. Real end-to-end numbers, not just matmul.

Still early (v0.0.1), but it looks like a solid step forward for efficient 4-bit inference on modern hardware. Made by the same lab behind GPTQ and Marlin.

GitHub:

github.com

GitHub - IST-DASLab/qutlass: QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning

QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning - GitHub - IST-DASLab/qutlass: QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning

Jul 16

10:02 AM

Make money doing the work you believe in

Log in or sign up