In this post, I’ll briefly explore DeepSeek’s latest paper, Inference-Time Scaling for Generalist Reward Modeling, published ahead of the rumored release of DeepSeek-R2.
The paper is fascinating: it introduces a new training method for reward model—a key component in reinforcement learning (RL) that scores LLM answers and helps guide them toward better performance—that can generate its own scoring criteria and then critique how well each response follows those guidelines. The researchers applied test-time compute to training the reward model by scaling up the generation of evaluation criteria and critiques, which led to more accurate and reliable reward models.