Does anyone get the sense that RLVR is not enough to train a model to be as good as an expert human at tasks like coding?
One observation I have made with experts in domains, is that they are most valuable when writing code that is hard to verify in generality (e.g testing various race conditions, cardinality of edge cases is really large etc). But experts seem to have intuition on whether the solution is sufficient without actually going through the exercise of verifying all possible paths.
For this reason you cannot get expert human level performance on a weak model with more inference compute, precisely because verification is so expensive. And I don’t think the way I have seen RLVR be implemented is sufficient to post-train a model to learn these heuristics either.
Dec 31
at
3:57 AM
Relevant people
Log in or sign up
Join the most interesting and insightful discussions.