The LatentMoE architecture of Nemotron 3 is interesting and a great learning exercise for LLM / MoE inference patterns…
MoE basics. An MoE layer basically just makes multiple copies of the feed-forward component of the transformer block. Instead of a single feed-forward neural net in each transformer block / layer, we have N of them. For each token activation, we sparsely select K experts to which that token is routed by creating a (trainable) routing mechanism; e.g., a linear layer that takes in a token activation and outputs a vector with dimension N. The routing operation does all-to-all communication so that we can send tokens to the GPUs that store the experts to which they are routed. Then, we do another all-to-all communication to collect tokens after expert computation.
What is LatentMoE? The key idea of LatentMoE is that, before expert routing, we down-project token activations from hidden dimension d to a smaller latent dimension l, do expert routing / computation in the latent / smaller space, then project back up afterward. This makes the routed part of the MoE cheaper in two ways:
1. token activations are smaller, so all-to-all communication is cheaper.
2. each expert has a smaller weight matrix, so loading expert weights is cheaper as well.
To understand why this matters, we need to understand MoE bottlenecks a bit better. Side Note: Nemotron also has shared expert(s) that have the full dimension in addition to routed experts (which operate in the latent dimension).
Key MoE inference patterns. There are two high-level categories of inference patterns that are commonly experienced for MoEs:
1. Latency-bound: process tens to hundreds of tokens at a time, prioritize low response time and responsiveness / fast answers (e.g., chat)
2. Throughput-bound: process thousands of tokens at a time, prioritize processing many requests in parallel and finishing the maximum number of requests / tokens per second (e.g., offline inference jobs, serving many chat interfaces from one model).
These types of inference have different bottlenecks. For latency-bound, our main bottleneck is memory bandwidth, meaning that we spend most of the time reading expert weights from memory rather than actually running matrix multiplies. This is because we aren’t handling many tokens, so there is less compute happening each time we load weights into memory.
On the other hand, throughput-bound inference is communication-bound. The biggest bottleneck is the all-to-all communication that happens in the MoE layer. This is because we are processing many tokens, so there is a lot of compute happening but we also need to communicate these tokens across GPUs after routing.
How LatentMoE helps is straightforward. By operating in the latent dimension, we both reduce the cost of all-to-all communication and reduce the cost of loading expert weights into memory. So, we actually benefit both inference patterns by making this change.
Scaling up. Instead of just achieving cost savings, however, Nemotron reinvests the saved costs into scaling up the MoE by increasing the number of experts and the number of active experts per token by a factor of d / l. The smaller latent routing space makes the MoE cheaper, but that budget is used to create a larger model. For example, Nemotron 3 tests the following setting to get the metrics in the table shown in the figure below:
"The Standard MoE model uses a hidden dimension of size 𝑑 = 4096 and 128 total experts with 6 active experts, while LatentMoE uses a latent dimension ℓ = 1024 and 512 total experts with 22 active experts."
As shown in the table, this approach works pretty well and we see consistent performance benefits across multiple benchmarks. We basically get a better model with a similar computational / communication budget, which aligns with prior work showing that finer-grained smaller experts tend to improve performance for MoEs compared to fewer large experts.