Cameron R. Wolfe, Ph.D. (@cwolferesearch): "I've been reading a lot about rubrics-as-rewards (RaR) for RL. Some of my favorite papers (so far): https://arxiv.org/abs/2507.17746 https://arxiv.org/abs/2508.12790 https://arxiv.org/abs/2510.07743 https://arxiv.org/abs/2511.19399 https://arxiv.org/abs/2507.186…"

Make money doing the work you believe in

I've been reading a lot about rubrics-as-rewards (RaR) for RL. Some of my favorite papers (so far):

arxiv.org/abs/2507.17746
arxiv.org/abs/2508.12790
arxiv.org/abs/2510.07743
arxiv.org/abs/2511.19399
arxiv.org/abs/2507.18624

Most of the added technical complexity of RaR is less related to RL and more related to reward modeling. If we can get a reliable reward signal, RaR works well, but teaching a model to perform granular / instance-level evaluation is tough. Generalizing these evaluation capabilities across arbitrary domains is even tougher (especially those that are highly subjective). Our reward model also needs to avoid hacking in large-scale RL runs.

In my opinion, new developments in this space are likely to come from advancing the frontier of (generative) reward models rather than RL. So much to be done.

Feb 4

4:49 AM

Make money doing the work you believe in

Log in or sign up