elvis (@elvissaravia): "How do you train small reasoning models more effectively? This is a problem many AI devs run into. RL fine-tuning, in general, tends to plateau, especially for 1–2B models. I think DeepSearch offers a really clean approach here. It takes the idea of Monte Carlo Tree Search (MC…"

The app for independent voices

How do you train small reasoning models more effectively?

This is a problem many AI devs run into. RL fine-tuning, in general, tends to plateau, especially for 1–2B models.

I think DeepSearch offers a really clean approach here. It takes the idea of Monte Carlo Tree Search (MCTS) at inference and moves it into the training loop. That shift unlocks better exploration and more efficient learning.

Here are my notes from the paper:

The loop involves four key ideas:

Searching During Training: Instead of only doing search at test-time, MCTS is run during RL training. A local UCT selector ranks siblings, while a global frontier scorer picks promising leaves across the whole tree based on parent value, entropy, and depth.

Learning From Both Wins and Confident Wrongs: If a correct solution isn’t found, the model still learns by supervising the confident wrong path (lowest entropy mistakes). Correct paths stay non-negative during updates, which helps with step-level credit assignment.

Stabilizing RL with Tree-GRPO: They refine PPO-style objectives with node-level q-values, mean-only normalization, and a soft clipping strategy. This avoids reward explosions while keeping gradients informative.

Staying Efficient: To cut wasted compute, DeepSearch filters to a hard subset of problems, caches solutions once they’re verified, and skips full search when an answer is already known.

All of these improvements lead to strong results.

DeepSearch-1.5B reaches 62.95% on AIME/AMC benchmarks, beating a top Nemotron baseline while using only ~330 GPU hours. By comparison, normal RL training plateaus lower even with 1,800+ GPU hours.

Paper: arxiv.org/abs/2509.25454

I think this paper offers a practical recipe for breaking through plateaus in small reasoning LMs:

• Move search into training, not just inference

• Supervise both right and wrong paths

• Use global prioritization to explore smarter

• Cache and filter to keep efficiency high

Oct 2

4:01 PM

The app for independent voices

Log in or sign up