RLHF Book status update: lot's of great changes.
Over the past month I've been doing a top to bottom update to the RLHF book. All of these changes are reflected on the website rlhfbook.com, and will soon be translated to the Manning early access version (MEAP), and then more improvements for the physical copy.
Overall, this took the PDF from ~150 to ~200 pages, the book is much more well rounded now.
Some of the larger changes:
Updates to the RL chapter to add more algorithms like GSPO, CISPO, etc.
Updated the big table of reasoning model tech reports (full list below). Added a section on Rubrics for RLVR.
Updated the text in many chapters to better reflect best practices of today.
Many clarity fixes throughout, adding better transitions, introductions, etc.
More consistent notation throughout the book.
I strongly recommend taking a look again if you only looked in the first half of 2025. There are also many surprising details, such as fixing this attached RLHF system diagram you may recognize from my first HuggingFace RLHF blog post in December of 2022, it had a bunch of minor errors.
Next step I'm going to be focusing on making the physical Manning book great. The content will flow more smoothly than the web version (i'm trying to not change the links), such as linking the constitutional AI and synthetic data chapters. Overall this should make it read better from front to back. Also, all the diagrams and content will be designed to have a much more elegant presentation.
Thanks for reading and feedback!
Reasoning model reports I recommend reading:
2025-01-22 - DeepSeek R1 - arxiv.org/abs/2501.12948
2025-01-22 - Kimi 1.5 - arxiv.org/abs/2501.12599
2025-03-31 - Open-Reasoner-Zero - arxiv.org/abs/2503.24290
2025-04-10 - Seed-Thinking 1.5 - arxiv.org/abs/2504.13914
2025-04-30 - Phi-4 Reasoning - arxiv.org/abs/2504.21318
2025-05-02 - Llama-Nemotron - arxiv.org/abs/2505.00949
2025-05-12 - INTELLECT-2 - arxiv.org/abs/2505.07291
2025-05-12 - Xiaomi MiMo - arxiv.org/abs/2505.07608
2025-05-14 - Qwen 3 - arxiv.org/abs/2505.09388
2025-05-21 - Hunyuan-TurboS - arxiv.org/abs/2505.15431
2025-05-28 - Skywork OR-1 - arxiv.org/abs/2505.22312
2025-06-04 - Xiaomi MiMo VL - arxiv.org/abs/2506.03569
2025-06-04 - OpenThoughts - arxiv.org/abs/2506.04178
2025-06-10 - Magistral - arxiv.org/abs/2506.10910
2025-06-16 - MiniMax-M1 - arxiv.org/abs/2506.13585
2025-07-10 - Kimi K2 - arxiv.org/abs/2507.20534
2025-07-28 - GLM-4.5 - arxiv.org/abs/2508.06471
2025-08-20 - Nemotron Nano 2 - arxiv.org/abs/2508.14444
2025-09-09 - K2-Think - arxiv.org/abs/2509.07604
2025-09-23 - LongCat-Flash-Thinking - arxiv.org/abs/2509.18883
2025-10-21 - Ring-1T - arxiv.org/abs/2510.18855
2025-11-20 - OLMo 3 Think - arxiv.org/abs/2512.13961
2025-12-02 - DeepSeek V3.2 - arxiv.org/abs/2512.02556
2025-12-05 - K2-V2 - arxiv.org/abs/2512.06201
2025-12-15 - Nemotron 3 Nano - arxiv.org/abs/2512.20848
2025-12-16 - MiMo-V2-Flash - raw.githubusercontent.c…