Youssef Hosni (@youssefhosni95)

Make money doing the work you believe in

Half the experts compute in Qwen3-30B-A3B, and GLM-4.7-Flash can be cut with marginal accuracy loss and ~20% faster inference.

A Mixture-of-Experts(MoE) model is a transformer where most of the parameters live in "experts" — small feed-forward networks inside each layer. Instead of running every token through every expert, a router picks the top-K most relevant expertsper token and runs the token only through those. That's how Qwen3-30B-A3B has 30B total parameters but only activates ~3B per token.

The catch: K is fixed. Every token gets the same expert budget, whether it's a filler word like "the" or a tricky derivation step. A token that needed one expert and a token that needed eight both get K. Dynamic MoE tries to vary K per token, but most prior methods require pre-training from scratch — useless for the open-weight MoE models people actually deploy.

That's where ZEDA comes in. New paper from Tsinghua + Shanghai AI Lab: Zero-Expert Self-Distillation Adaptation. It adds a strange option to the router — "experts" that output exactly zero. When the router picks one of these for a token, the token skips an expensive transformation entirely. K stays nominally fixed, but the effective compute spent per token can drop by half.

The clever part isn't the dummy expert. It's the adaptation. They freeze the original MoE and use it as the teacher. A new version of the same model — with zero-output experts wired into every MoE layer — learns from the teacher in two stages.

First SFT on teacher rollouts to stabilize the architecture change. Then, on-policy distillation, where the student generates its own rollouts and the teacher provides token-level targets. The student learns when it is safe to skip work without drifting from the teacher's instructions.

Results across 11 benchmarks (math, code, instruction following):

~51-53% of expert activations replaced with zero experts
Marginal average accuracy loss vs the original
1.20x end-to-end inference speedup
Beats prior dynamic MoE baselines by 6.1 and 4.0 points
Adaptation cost: <31 hours on 8x H200 for Qwen, <62 for GLM

The finding I keep coming back to: expert usage didn't track task difficulty. The model didn't spend more compute on hard tasks and less on easy ones. It spent more compute where the student disagreed with the teacher, or where its own next-token uncertainty was high. Structured code and math fragments — usually the "hard" content in benchmark folklore — needed fewer experts, not more.

This isn't pruning. It's the router learning where the model is unsure.

May 26

2:28 PM

Make money doing the work you believe in

Log in or sign up