Cool research paper from Google.
This is what clever context engineering looks like.
It proposes Tool-Use-Mixture (TUMIX), leveraging diverse tool-use strategies to improve reasoning.
This work shows how to get better reasoning from LLMs by running a bunch of diverse agents (text-only, code, search, etc.) in parallel and letting them share notes across a few rounds. Instead of brute-forcing more samples, it mixes strategies, stops when confident, and ends up both more accurate and cheaper.
Mix different agents, not just more of one: They ran 15 different agent styles (CoT, code execution, web search, guided variants). Each agent sees both the question and other agents’ past answers, then tries again. This back-and-forth makes the group smarter than any single agent.
Stop early, save cost: More rounds don’t always help. Too much refinement can kill diversity. They use an LLM-judge to decide when to stop. That keeps accuracy high while cutting costs almost in half.
Better than existing methods: Compared with other tool-augmented scaling tricks, TUMIX consistently scores higher on tough reasoning benchmarks (HLE, GPQA-Diamond, AIME). For Gemini-2.5 Pro, it pushed HLE to 34.1%, which is a notable gain.
Diversity is the secret sauce: Combining text, code, and search agents beats repeatedly sampling the best single agent. More diverse tool use = more chances to land on the right reasoning path.
Auto-agent design: They even had the LLM generate new agent types and mixed those in, which boosted results further. The sweet spot was around 12–15 different agent styles in the mix.
arxiv.org/abs/2510.01279
Track trending AI papers here: nlp.elvissaravia.com