I have been pretty heads-down this year to finish Chapter 6 on implementing reinforcement learning with verifiable rewards from scratch (using GRPO). I just finished it this weekend, and I'd say it's the best (or at least my favorite) chapter yet!
The goal of this chapter is to explain and implement GRPO from the bottom up. This means coding and walking through each GRPO step one by one (advantages, rewards, logprobs, and loss) and then training a 0.6B base model on the 12k examples from the MATH training set.
(This takes the model from 15% to 47% accuracy on the MATH-500 test set, which is about as good as the official Qwen3 reasoning model of similar size.)
The focus is on readability and understanding GRPO, but the supplementary materials also contain scripts to run it in a multi-GPU setting.
The code notebook is already available on GitHub if you want to take a look: github.com/rasbt/reason….
(And the full chapter should make it to the early access version of the book at mng.bz/Nwr7 soon!)
PS: The next chapter will introduce additional tips and tricks to improve the GRPO algorithm for better and more stable training behavior.