Make money doing the work you believe in

On-policy distillation is on track to be a lasting method in post-training. The list of areas would be:

Instruction tuning (SFT/IFT)

RLHF

Direct Preference Optimization (DPO et al)

RLVR

On-policy Distillation (OPD)

New classes of methods are rare! Excited to play.

May 18
at
11:02 PM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.