New paper! Bringing ideas from meta RL into the LM RL domain to help solve the hardest problems with sequential attempts.
It's a self-reflection approach, but it can be generalized. LMs should learn from context when using RL on very hard problems. Not just more attempts from 0 (ie standard GRPO). Led by Teng Xiao.