Finished Ch07 on Improving GRPO for Reinforcement Learning!
Building on the GRPO from scratch intro, this adds (and analyzes) more bells and whistles! (Clipped policy ratios, KL term, format rewards, and couple of improvements.)
github.com/rasbt/reason…