Make money doing the work you believe in

I spent 12 hours testing 9 LLMs for building AI agents:

  • You might easily save up to 83% on costs.

  • Reasoning models are not the best.

  • Autonomy break fast. A real moat is orchestration.

Here's everything you need to know:

The task assigned to agents:

  1. Create a new list inside Kanban (1 Trello board available)

  2. Search the web to find the recent news about Amazon

  3. Add all the search results to the Kanban list

Completing this task required:

  • Preparing a simple plan

  • Using multiple tools in the right order

  • Adjusting the plan if anything goes wrong

  • Operating autonomously (the system prompt gave no clues about the process)

Aggregated results (multiple runs):

🟢 GPT 4.1: A new list created on 2nd attempt

🟢 GPT 4.1-mini: A new list on 2nd attempt

🟡 GPT o3-mini: A new list on 2nd attempt, doesn't create all cards

🟢 Grok 4: A new list on 2nd attempt

🔴 Gemini 2.5 Flash: Fails 90%+ of a time, doesn't try to adjust the plan

🟡 Gemini 2.5 Pro: A new list on 2nd attempt, duplicates lists

🔴 Kimi K2: Fails 90%+ of a time, doesn't try to adjust the plan

🟡 Claude Sonnet 4: A new list on 2nd attempt, doesn't create all cards

🔴 DeepSeek R1: Fails 90%+ of a time, doesn't try to adjust the plan

Of course it was just a single task.

But the results align well with what I've been observing in the recent months.

Best practices:

  • Use non-reasoning LLMs for planning steps by default

  • Provide a scratchpad for AI agent to make notes

  • Or use solution like Sequential Thinking MCP to control the process

Observations:

  • 4.1 > 4.1-mini > Claude Sonnet 4 > Other models

  • 4.1-mini is 83% cheaper than 4.1 but performs similarly in many cases

  • Above 2-3 tool calls, every model makes mistakes

  • If the process requires more than 5 tools and dynamic planning, a single AI agent will most likely fail. No matter the model. You need to clearly separate the roles and move the logic to the orchestration layer.

Bottom line:

We can create great agentic workflows and multi-agent systems. They follow well-defined processes powered by LLMs. Like my multi-agent research system (Anthropic's clone): productcompass.pm/p/mul…

‎‎

Possible: LLM-powered process automation

Doesn't exist yet: Fully "agentic AI"

Essential: Orchestration

‎‎

An extra video (one of my tests) in the comments.

Hope that helps.

P.S. Find this helpful?

You might enjoy 14 Principles of Building AI Agents: productcompass.pm/p/bui…

Jul 25
at
3:37 AM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.