Make money doing the work you believe in

I'm giving a (short) talk on the origins, motivations, and key ideas of rubrics & RL tomorrow.

Rubric-based RL is a nice area of research that emerged very naturally from several concurrent research areas:

- RLVR works really well.

- We want RLVR for non-verifiable domains.

- Usually RLHF is used in this case, but reward models are hackable and this requires lots of preference data.

- LLM judges work really well (especially with recent reward models).

- We can make LLM judges more reliable by providing granular, prompt-specific scoring criteria (i.e., rubrics).

- Concurrent research has already explored using rubrics for safety alignment.

- Why not just directly use rubrics to derive a reward for RL in arbitrary domains? You can, and it works well.

Results with rubric-based RL vary by domain but are getting more impressive over time as we get more capable LLM judges. While most of the talk focuses on explaining where rubric-based RL came from, I also provide links to my favorite papers in the space.

Slides are here: docs.google.com/present…

Feb 26
at
3:37 AM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.