Make money doing the work you believe in

An imitation-trained model cannot exceed its demonstrators.

This is not a hyperparameter problem. It is a structural one. By definition, SFT shapes the model toward responses that resemble the demonstrations. If the best available human reasoners average a certain quality, an SFT-only model trained on their work asymptotes near that quality.

To produce a model that reasons better than the best humans available to label, the training signal needs to come from somewhere other than human demonstration.

This is what motivates the rest of post-training. Not aesthetic preference for one paradigm over another. A capability ceiling that imitation cannot break.

After imitation, every paradigm is an answer to this problem.

The Age of Post-Training, Part 1: Learning by Imitation
May 8
at
9:22 PM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.