Make money doing the work you believe in

RLHF Book status update: lot's of great changes.

Over the past month I've been doing a top to bottom update to the RLHF book. All of these changes are reflected on the website rlhfbook.com, and will soon be translated to the Manning early access version (MEAP), and then more improvements for the physical copy.

Overall, this took the PDF from ~150 to ~200 pages, the book is much more well rounded now.

Some of the larger changes:

  • Updates to the RL chapter to add more algorithms like GSPO, CISPO, etc.

  • Updated the big table of reasoning model tech reports (full list below). Added a section on Rubrics for RLVR.

  • Updated the text in many chapters to better reflect best practices of today.

  • Many clarity fixes throughout, adding better transitions, introductions, etc.

  • More consistent notation throughout the book.

I strongly recommend taking a look again if you only looked in the first half of 2025. There are also many surprising details, such as fixing this attached RLHF system diagram you may recognize from my first HuggingFace RLHF blog post in December of 2022, it had a bunch of minor errors.

Next step I'm going to be focusing on making the physical Manning book great. The content will flow more smoothly than the web version (i'm trying to not change the links), such as linking the constitutional AI and synthetic data chapters. Overall this should make it read better from front to back. Also, all the diagrams and content will be designed to have a much more elegant presentation.

Thanks for reading and feedback!

Reasoning model reports I recommend reading:

2025-01-22 - DeepSeek R1 - arxiv.org/abs/2501.12948

2025-01-22 - Kimi 1.5 - arxiv.org/abs/2501.12599

2025-03-31 - Open-Reasoner-Zero - arxiv.org/abs/2503.24290

2025-04-10 - Seed-Thinking 1.5 - arxiv.org/abs/2504.13914

2025-04-30 - Phi-4 Reasoning - arxiv.org/abs/2504.21318

2025-05-02 - Llama-Nemotron - arxiv.org/abs/2505.00949

2025-05-12 - INTELLECT-2 - arxiv.org/abs/2505.07291

2025-05-12 - Xiaomi MiMo - arxiv.org/abs/2505.07608

2025-05-14 - Qwen 3 - arxiv.org/abs/2505.09388

2025-05-21 - Hunyuan-TurboS - arxiv.org/abs/2505.15431

2025-05-28 - Skywork OR-1 - arxiv.org/abs/2505.22312

2025-06-04 - Xiaomi MiMo VL - arxiv.org/abs/2506.03569

2025-06-04 - OpenThoughts - arxiv.org/abs/2506.04178

2025-06-10 - Magistral - arxiv.org/abs/2506.10910

2025-06-16 - MiniMax-M1 - arxiv.org/abs/2506.13585

2025-07-10 - Kimi K2 - arxiv.org/abs/2507.20534

2025-07-28 - GLM-4.5 - arxiv.org/abs/2508.06471

2025-08-20 - Nemotron Nano 2 - arxiv.org/abs/2508.14444

2025-09-09 - K2-Think - arxiv.org/abs/2509.07604

2025-09-23 - LongCat-Flash-Thinking - arxiv.org/abs/2509.18883

2025-10-21 - Ring-1T - arxiv.org/abs/2510.18855

2025-11-20 - OLMo 3 Think - arxiv.org/abs/2512.13961

2025-12-02 - DeepSeek V3.2 - arxiv.org/abs/2512.02556

2025-12-05 - K2-V2 - arxiv.org/abs/2512.06201

2025-12-15 - Nemotron 3 Nano - arxiv.org/abs/2512.20848

2025-12-16 - MiMo-V2-Flash - raw.githubusercontent.c…

Jan 2
at
4:34 PM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.