I'm giving a (short) talk on the origins, motivations, and key ideas of rubrics & RL tomorrow.
Rubric-based RL is a nice area of research that emerged very naturally from several concurrent research areas:
- RLVR works really well.
- We want RLVR for non-verifiable domains.
- Usually RLHF is used in this case, but reward models are hackable and this requires lots of preference data.
- LLM judges work really well (especially with recent reward models).
- We can make LLM judges more reliable by providing granular, prompt-specific scoring criteria (i.e., rubrics).
- Concurrent research has already explored using rubrics for safety alignment.
- Why not just directly use rubrics to derive a reward for RL in arbitrary domains? You can, and it works well.
Results with rubric-based RL vary by domain but are getting more impressive over time as we get more capable LLM judges. While most of the talk focuses on explaining where rubric-based RL came from, I also provide links to my favorite papers in the space.
Slides are here: docs.google.com/present…