The app for independent voices

A lot of people are talking about this paper as a simple replication of test time compute (TTC) scaling: arxiv.org/abs/2501.19393

I think it’s worth reading but I finished it a little confused:

  • they do SFT on Gemini Thinking reasoning traces. isn’t that model already a TTC scaled model? if the point is to replicate TTC scaling, then surely distilling from a teacher model that already has that capability seems to be missing the point?

  • they use “budget forcing” to control output length—interrupt and force an answer for shorter outputs, suppress EOS token and append “wait” for longer outputs. this is how they can show that more tokens = better outputs, i.e., TTC scaling. but wouldn’t you expect this same behavior from a base model? i.e. if you interrupt a CoT at step 1, step 2, step 3, etc., you’d get increasingly better results? this seems like a no brainer but as far as i can tell they didn’t test budget forcing on the baseline model. very strange.

  • they even say that they expect reasoning ability is already present. they just think their finetuning activates it. but they don’t actually test that claim?

    “We hypothesize that the model is already exposed to large amounts of reasoning data during pretraining which spans trillions of tokens. Thus, the ability to perform reasoning is already present in our model. Our sample-efficient finetuning stage just activates it and we scale it further at test time with budget forcing.”

  • to be clear: obviously the TTC scaling from reasoning models is a different beast than simple CoT (see llmpromptu.substack.com… and llmpromptu.substack.com…). but there is some very limited TTC scaling in CoT already, and this paper doesn’t seem to be clearly distinguishing these things.

Chain of Thought in the Era of Reasoning Models
Feb 3
at
2:33 PM

Log in or sign up

Join the most interesting and insightful discussions.