Paweł Huryn (@huryn): "I spent 12 hours testing 9 LLMs for building AI agents: You might easily save up to 83% on costs. Reasoning models are not the best. Autonomy break fast. A real moat is orchestration. Here's everything you need to know: ‎ The task assigned to agents: Create a new list i…"

Make money doing the work you believe in

Paweł Huryn

Jul 25, 2025

The Product Compass

I spent 12 hours testing 9 LLMs for building AI agents:

You might easily save up to 83% on costs.
Reasoning models are not the best.
Autonomy break fast. A real moat is orchestration.

Here's everything you need to know:

‎

The task assigned to agents:

Create a new list inside Kanban (1 Trello board available)
Search the web to find the recent news about Amazon
Add all the search results to the Kanban list

‎

Completing this task required:

Preparing a simple plan
Using multiple tools in the right order
Adjusting the plan if anything goes wrong
Operating autonomously (the system prompt gave no clues about the process)

‎

Aggregated results (multiple runs):

🟢 GPT 4.1: A new list created on 2nd attempt

🟢 GPT 4.1-mini: A new list on 2nd attempt

🟡 GPT o3-mini: A new list on 2nd attempt, doesn't create all cards

🟢 Grok 4: A new list on 2nd attempt

🔴 Gemini 2.5 Flash: Fails 90%+ of a time, doesn't try to adjust the plan

🟡 Gemini 2.5 Pro: A new list on 2nd attempt, duplicates lists

🔴 Kimi K2: Fails 90%+ of a time, doesn't try to adjust the plan

🟡 Claude Sonnet 4: A new list on 2nd attempt, doesn't create all cards

🔴 DeepSeek R1: Fails 90%+ of a time, doesn't try to adjust the plan

‎

Of course it was just a single task.

But the results align well with what I've been observing in the recent months.

‎

Best practices:

Use non-reasoning LLMs for planning steps by default
Provide a scratchpad for AI agent to make notes
Or use solution like Sequential Thinking MCP to control the process

‎

Observations:

4.1 > 4.1-mini > Claude Sonnet 4 > Other models
4.1-mini is 83% cheaper than 4.1 but performs similarly in many cases
Above 2-3 tool calls, every model makes mistakes
If the process requires more than 5 tools and dynamic planning, a single AI agent will most likely fail. No matter the model. You need to clearly separate the roles and move the logic to the orchestration layer.

‎

Bottom line:

We can create great agentic workflows and multi-agent systems. They follow well-defined processes powered by LLMs. Like my multi-agent research system (Anthropic's clone): productcompass.pm/p/mul…

‎‎

Possible: LLM-powered process automation

Doesn't exist yet: Fully "agentic AI"

Essential: Orchestration

‎‎

An extra video (one of my tests) in the comments.

Hope that helps.

‎

—

‎

P.S. Find this helpful?

You might enjoy 14 Principles of Building AI Agents: productcompass.pm/p/bui…

Jul 25

3:37 AM

Make money doing the work you believe in

Log in or sign up