I spent 12 hours testing 9 LLMs for building AI agents:
You might easily save up to 83% on costs.
Reasoning models are not the best.
Autonomy break fast. A real moat is orchestration.
Here's everything you need to know:
The task assigned to agents:
Create a new list inside Kanban (1 Trello board available)
Search the web to find the recent news about Amazon
Add all the search results to the Kanban list
Completing this task required:
Preparing a simple plan
Using multiple tools in the right order
Adjusting the plan if anything goes wrong
Operating autonomously (the system prompt gave no clues about the process)
Aggregated results (multiple runs):
🟢 GPT 4.1: A new list created on 2nd attempt
🟢 GPT 4.1-mini: A new list on 2nd attempt
🟡 GPT o3-mini: A new list on 2nd attempt, doesn't create all cards
🟢 Grok 4: A new list on 2nd attempt
🔴 Gemini 2.5 Flash: Fails 90%+ of a time, doesn't try to adjust the plan
🟡 Gemini 2.5 Pro: A new list on 2nd attempt, duplicates lists
🔴 Kimi K2: Fails 90%+ of a time, doesn't try to adjust the plan
🟡 Claude Sonnet 4: A new list on 2nd attempt, doesn't create all cards
🔴 DeepSeek R1: Fails 90%+ of a time, doesn't try to adjust the plan
Of course it was just a single task.
But the results align well with what I've been observing in the recent months.
Best practices:
Use non-reasoning LLMs for planning steps by default
Provide a scratchpad for AI agent to make notes
Or use solution like Sequential Thinking MCP to control the process
Observations:
4.1 > 4.1-mini > Claude Sonnet 4 > Other models
4.1-mini is 83% cheaper than 4.1 but performs similarly in many cases
Above 2-3 tool calls, every model makes mistakes
If the process requires more than 5 tools and dynamic planning, a single AI agent will most likely fail. No matter the model. You need to clearly separate the roles and move the logic to the orchestration layer.
Bottom line:
We can create great agentic workflows and multi-agent systems. They follow well-defined processes powered by LLMs. Like my multi-agent research system (Anthropic's clone): productcompass.pm/p/mul…
Possible: LLM-powered process automation
Doesn't exist yet: Fully "agentic AI"
Essential: Orchestration
An extra video (one of my tests) in the comments.
Hope that helps.
—
P.S. Find this helpful?
You might enjoy 14 Principles of Building AI Agents: productcompass.pm/p/bui…