Sebastian Raschka, PhD (@rasbt): "According to benchmarks, a new small 3B-parameter model achieves Opus-4.5-/frontier-level coding performance. It comes with a very good technical report with lots to learn. What's fascinating is, the whole model builds on the old Qwen2.5-Coder-3B stack (yes, Qwen2.5, not Qwen3.…"

Make money doing the work you believe in

According to benchmarks, a new small 3B-parameter model achieves Opus-4.5-/frontier-level coding performance.

It comes with a very good technical report with lots to learn. What's fascinating is, the whole model builds on the old Qwen2.5-Coder-3B stack (yes, Qwen2.5, not Qwen3.5).

So, that's a pretty clear example that highlights how much of the performance gains come from good data curation and post-training pipelines.

Based on the tech report, here are some of the important pieces of their post-training stack:

1. High-signal synthetic data (math problems with credible solutions, code with tests)

2. Multiple reasoning paths for each answer

3. Filtering, filtering, filtering

4. 2-stage SFT (start with broad training, then train on hard long-reasoning samples)

5. Use target (pass@k) accuracy over validation loss for checkpoint selection

6. MGPO (MaxEnt-Guided Policy Optimization) for RLVR: basically a GRPO-style RL method with an extra weighting that favors examples that are neither too easy nor too hard for the current policy

7. Single 64k long-context RL (they found that the usual progressive context expansion hurt this model because early truncation damaged long-thinking behavior)

8. Training data order: they do Math RL, then Code RL, then STEM RL in this particular oder which they found helped overall

9. After optimizing for accuracy, they add a stage that rewards shorter correct trajectories; basically making the model more efficient without accuracy degradation

10. Offline self-distillation: they collect high-quality verified trajectories from the Math, Code, and STEM RL checkpoints, filter them, and distill them back into one unified student model

11. Instruct RL: the final stage uses rule-based validators and rubric-based reward models to improve instruction following (again, while preserving the reasoning gains)

Besides, it's also really cool to see how far one can push a small 3B model, and that one can do impressive research and engineering work on a small(er) scale!

They don't share the exact GPU hours for this project, but if we were to go with their previous VibeThinker 1.5B model report (which had some numbers), I'd probably say it cost around $25k to $60k. Sure, that's a lot of money, but it's not millions!

(Caveat: the model is pretty new, and benchmarks could be too good to be true; need to use it in the next days to see if vibes of VibeCoder actually check out in practice. But impressive first impression! And nice post-training method write-up.)

Jun 17

1:13 PM

Make money doing the work you believe in

Log in or sign up