Interested in learning how to run RL at scale? Here are the best resources to read…
Research on Scaling RL
The Art of Scaling RL compute for LLMs: arxiv.org/abs/2510.13786
Scaling Behaviors of LLM RL Post-Training: arxiv.org/abs/2509.25300
Optimally Scaling Sampling Compute for LLM RL: arxiv.org/abs/2603.12151
Scaling up RL: arxiv.org/abs/2507.12507
ProRL V2 - Prolonged Training Validates RL Scaling Laws: hijkzzz.notion.site/pro…
Polaris - A Recipe for Scaling RL with Reasoning Models: hkunlp.github.io/blog/2…
RL Frameworks
Hybrid Flow (outline of the verl framework): arxiv.org/abs/2409.19256
More up-to-date info can be found here: arxiv.org/abs/2601.18150
AReal - Large-Scale Async RL: arxiv.org/abs/2505.24298
PipelineRL - Fast On-Policy RL: arxiv.org/abs/2509.19128
AsyncFlow - Async Streaming RL: arxiv.org/abs/2507.01663
RL for Agents
DeepSWE - Open Coding Agent Trained w/ RL: together.ai/blog/deepswe
AutoForge - Environment Synthesis for Agentic RL: arxiv.org/abs/2512.22857
Agent-R1 - Training Agents w/ End-to-End RL: arxiv.org/abs/2511.14460
AgentRL - Scaling RL for Multi-Turn, Multi-Task Agents: arxiv.org/abs/2510.04206
The Landscape of Agentic RL: arxiv.org/abs/2509.02547
Training SWE Agents with RL: arxiv.org/abs/2508.03501
Case Studies & Tech Reports
Kimi tech reports:
Kimi K2 - Open Agentic Intelligence: arxiv.org/abs/2507.20534
Kimi End-to-end Agentic RL: moonshotai.github.io/Ki…
Kimi K1.5 - Scaling RL for LLMs: arxiv.org/abs/2501.12599
Composer series from Cursor:
Composer 2: arxiv.org/abs/2603.24477
Composer 2.5: cursor.com/blog/compose…
Olmo 3 (also has open code / data): arxiv.org/abs/2512.13961
MiniMax tech reports:
MiniMax-M2: arxiv.org/abs/2605.26494
MiniMax-M1: arxiv.org/abs/2506.13585
Nemotron 3 (NVIDIA): arxiv.org/abs/2512.20856