Make money doing the work you believe in

New paper! Bringing ideas from meta RL into the LM RL domain to help solve the hardest problems with sequential attempts.

It's a self-reflection approach, but it can be generalized. LMs should learn from context when using RL on very hard problems. Not just more attempts from 0 (ie standard GRPO). Led by Teng Xiao.

Mar 16
at
5:38 PM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.