Recently read a great overview of multi-teacher on-policy distillation (MOPD) and how it was used in recent LLMs like MiMo-V2-Flash, GLM-5, Nemotron-Cascade 2, and DeepSeek-V4…
What is on-policy distillation (OPD)? The idea of OPD is simple. We have a student and a teacher. We sample trajectories from the student, then use reverse KL divergence as an objective to match the teacher’s log probability distribution along these trajectories. This training setup can be integrated into the GRPO loss by replacing group-relative advantage with reverse KL.
Multi-teacher OPD (MOPD) extends this idea by having more than one teacher during OPD training. This idea is useful due to the domain-specific nature of RLVR training. If we train a model on math with RLVR, this may improve math performance but harm model quality on creative tasks. Similarly, RL training a model on tool use data could degrade performance on general-purpose benchmarks. To solve this see-saw problem, we can train domain-specific models with RL and use MOPD to distill them into a single student.
Post-training with MOPD has become a common choice for recent models:
MiMo-V2-Flash: starts from a general SFT model and uses domain-specific models from across the post-training pipeline (i.e., SFT models, RL specialists, and the student itself) as teachers for MOPD at the final stage of post-training, where teachers are selected heuristically by domain.
GLM-5: starts from the final RL checkpoint produced by a sequential RL pipeline of reasoning, agentic, and general domains, using the final checkpoint of each stage as a teacher, where the teacher is again selected by the domain of the prompt. MOPD here aims to recover capabilities rather than merging them across domains.
Nemotron-Cascade 2: places MOPD at the mid-point of post-training as a stabilization step between stages. Three prior model checkpoints are chosen as teachers from prior training stages for math, RLHF, and multi-domain RL.
DeepSeek-V4: trains a very large number (10+) of domain experts independently using domain-specific SFT and RL, then distills all of them into a single student. This paper interestingly uses full vocabulary distillation, which has very high memory overhead and is infrastructurally complex, instead of approximating KL with a single logit.
The blog also includes a great snippet about why self-distillation is a useful addition in MOPD: “Self is a snapshot of the student at the start of MOPD—a fixed, stable reference distribution. On tokens where the SFT/RL teachers push the student into unfamiliar territory, distilling toward Self prevents catastrophic drift.”
Despite MOPD becoming quite common in several different reports, the approaches used are all quite similar (i.e., reverse KL, on-policy distillation, multiple teachers), indicating that OPD / MOPD is becoming a more standardized approach in training pipelines for recent models.