Nathan Lambert (@natolambert): "Added a 1500 word mini history to my book on the path to on-policy distillation being a core post-training optimization technique. The math is fairly simple, seems like the sort of thing that started working as our distributed systems for training got better. It's very remarkab…"

Make money doing the work you believe in

Added a 1500 word mini history to my book on the path to on-policy distillation being a core post-training optimization technique.

The math is fairly simple, seems like the sort of thing that started working as our distributed systems for training got better. It's very remarkable to me that a blog post from Kevin Lu at Thinking Machines is the canonical reference for using the reverse KL distance as an advantage within policy-gradient tools. This switch to distillation objectives within RL setups enables a lot of fun reward shaping ideas.

This also means that on policy distillation was obviously helped in its proliferation by the mass engineering effort in getting RL algorithms right over the last few years.

Lastly, as someone already very familiar with Rishabh Agarwal's early work on generalized knowledge distillation / connection to imitation learning algorithms like DAgger, I recommend reading concurrent work MiniLLM which was technically the first to propose using a policy-gradient-like, on-policy rollout approach for distillation.

The switch from learning from teacher demonstrations to student rollouts seems so obvious in hindsight, where we are with RL hype, but at the time obviously took at bunch of work to get right.

Excited to figure out how to make post-training recipes around this!

rlhfbook.com

Synthetic Data & CAI | RLHF Book by Nathan Lambert

The Reinforcement Learning from Human Feedback Book

May 6

3:14 PM

Make money doing the work you believe in

Log in or sign up