Sebastian Raschka, PhD (@rasbt): "A good week for open-weight LLMs. A bunch of nice models that run locally (even Gemma 4 12B only needs less than 16 GB of RAM). Ok, Nemotron 3 Ultra is a bit too large to run locally, but it comes with a seriously impressive performance: efficiency ratio, and will be an open-w…"

Make money doing the work you believe in

A good week for open-weight LLMs. A bunch of nice models that run locally (even Gemma 4 12B only needs less than 16 GB of RAM).

Ok, Nemotron 3 Ultra is a bit too large to run locally, but it comes with a seriously impressive performance: efficiency ratio, and will be an open-weight model hard to beat in this aspect for a long time.

At a high level, Nemotron 3 Ultra is the large sibling of Nemotron 3 Super. The overall design is still a hybrid Mamba-Transformer MoE stack, but scaled to 550B total parameters with 55B active per token.

The part I find most interesting is the Latent MoE idea (introduced in Nemotron 3 Super).

In a regular MoE layer, the routed experts operate directly at the model width. In Latent MoE, the routed path is first projected down into a smaller latent space, the experts operate there, and the result is projected back up.

For Super, this was 4096 -> 1024 -> 4096.

For Ultra, it is 8192 -> 2048 -> 8192.

So the 4x compression ratio stays the same, but the model is scaled up substantially.

This is a nice example of architecture scaling by combining several efficiency mechanisms: Mamba-2, GQA, Latent MoE.

As always, more details and higher-res figures in my LLM Architecture Gallery.

Jun 4

4:56 PM

Make money doing the work you believe in

Log in or sign up