A Chinese whistleblower from Meta’s AI team on Llama 4:
After repeated training, the performance of the internal model still fails to reach open-source SOTA levels, and is even far behind them. Company leadership suggested mixing various benchmark test sets into the post-training process, aiming to produce a result that “looks okay” acros…
In this post, I’ll briefly explore DeepSeek’s latest paper, Inference-Time Scaling for Generalist Reward Modeling, published ahead of the rumored release of DeepSeek-R2.
The paper is fascinating: it introduces a new training method for reward model—a key component in reinforcement learning (RL) that scores LLM answers and helps guide them…