On-policy distillation is on track to be a lasting method in post-training. The list of areas would be:
Instruction tuning (SFT/IFT)
RLHF
Direct Preference Optimization (DPO et al)
RLVR
On-policy Distillation (OPD)
New classes of methods are rare! Excited to play.