NEW paper from Sakana AI (ICLR 2026).
A 7B Conductor model just hit SOTA on GPQA-Diamond and LiveCodeBench by orchestrating other LLMs instead of solving problems itself.
(great paper! bookmark it!)
The Conductor is trained with RL to do two things at once: design communication topologies between worker agents (open or closed source), and prompt-engineer focused instructions to each worker so it leverages their individual strengths.
It's like training a special agent to take care of both collaboration and communication.
Trained against randomized agent pools, it adapts to arbitrary mixes of agents at inference time. Even more interesting: when allowed to pick itself as a worker, it forms recursive topologies, unlocking a new form of dynamic test-time scaling through online iterative adaptation.
The gains over the best individual worker on AIME25 and GPQA-D land in the ~3% range, which the authors note is consistent with entire generational improvements between frontier model versions, except this one comes from coordination, not pretraining.
Why it matters?
We can start to think of the orchestrator as the model now. Routing decisions aren't just a wrapper, they're a learnable policy.
Paper: arxiv.org/abs/2512.04388
Learn to build effective AI agents in our academy: academy.dair.ai