Implementing Warp Decode on a brand new MoE model, on different hardware, outside the original lab. That’s the kind of work that moves inference from “impressive in a paper” to “actually viable at scale.”
Cursor published their Warp Decode approach last week. The core idea flips how GPU parallelism works on Mixture-of-Experts models. Instead of organizing warps around experts (standard), each warp handles a single output neuron and streams across all routed experts. On their B200 GPUs, they got 1.84x throughput. It’s clever architecture thinking applied to a real bottleneck.
I’m building this on Google’s new Gemma 4 (26B MoE, launched April 3rd) running on my DGX Spark’s GB10 Blackwell chip. This would be the first independent reproduction of Warp Decode outside Cursor, and on a different Blackwell SKU than they tested. That matters because optimizations often don’t transfer cleanly across hardware variants.
Why should you care if you’re not deep in ML infrastructure? Because inference speed is what makes AI products economically viable. Faster inference means lower costs per query, better user experience, smaller GPU footprint. The difference between “we can afford to run this” and “we can’t” often comes down to optimizations like this.
Full results and writeup coming soon.