Make money doing the work you believe in

If you're working on low-bit inference for LLMs, the new QuTLASS library is worth checking out.

It’s built on top of CUTLASS and designed specifically for 4-bit (MXFP4) inference on NVIDIA Blackwell GPUs. What makes it interesting is how it taps into microscaling, which is a new Blackwell feature that applies fine-grained scale factors along the inner dimension of GEMMs (every 32 elements, for example).

QuTLASS supports fused Hadamard transforms + quantization, online scale computation, and comes with both CUTLASS-style and custom kernels (optimized for small batch sizes). It supports multiple quantization modes and is plug-and-play with rotation matrices at runtime.

Benchmarks show up to 4x speedup over BF16 in LLM inference. Real end-to-end numbers, not just matmul.

Still early (v0.0.1), but it looks like a solid step forward for efficient 4-bit inference on modern hardware. Made by the same lab behind GPTQ and Marlin.

GitHub: 

Jul 16
at
10:02 AM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.