Nathan Lambert (@natolambert): "I remember ~2.5 years ago, with Lewis T and Ed B and co at HuggingFace how it took months to get DPO working right. Today, coding agents can build an entire repository from scratch, referencing high-quality implementations and discussing trade-offs, and run a representative tra…"

Make money doing the work you believe in

I remember ~2.5 years ago, with Lewis T and Ed B and co at HuggingFace how it took months to get DPO working right.

Today, coding agents can build an entire repository from scratch, referencing high-quality implementations and discussing trade-offs, and run a representative training job on your desk. This was a 1B model on thousands of samples.

It really changes accessibility to AI research and tinkering, along with what it means to work in AI.

I just merged the PR for this which adds a bunch of direct alignment algorithms (DPO etc) to the rlhfbook code repo, and it's remarkable how much easier this is today.

I'm feeling even more confident about what the book is becoming -- a dense place for intuitions for what actually works with models, free of hallucinations and hypes. Students can use this as a reference beside code and experiments that the AI models can spin up in an afternoon.

At its best, the RLHF Book will become a central place for people to discuss, iterate, and make community around this learning material.

Nathan Lambert

Feb 1

Claude Code with Opus 4.5 driving, OpenAI's Codex for code review, GPT Pro for planning made a working DPO (and related algorithms) repository from scratch for my RLHF book, and the curves are looking right. On the dgx spark finetuning olmo 2 1b sft. Built by referencing the original repositories + TRL.

We're living in the future.

github.com/natolambert/rlhf-book/pull/226

Feb 2

3:40 PM

Make money doing the work you believe in

Log in or sign up