The app for independent voices

I hit a frustrating wall when I first started building AI agents.

But the lessons I learned now define how I build enterprise-grade apps.

Let me explain...

Early on, I was extremely comfortable manipulating text.

But the moment I tried to add PDFs, images, and audio, my clean architectures collapsed.

I built pipelines that chained together:

• OCR engines for PDFs

• Layout detection for tables and diagrams

• Custom classifiers for images

It looks sophisticated.

But behaved like a brittle machine that broke every time a document layout changed.

The breakthrough came when I realised I was solving the wrong problem.

I did not need to convert documents to text...

I just needed to treat them as images.

... And let multimodal LLMs handle them natively.

Once I understood that:

• Every PDF page is effectively an image

• Modern LLMs can “see” as well as they can read

• Images, audio, and text all become tokens

The entire system simplified.

And the accuracy increased.

Here's the gist:

If you keep normalising everything to text, you are throwing away the information that matters most.

In the next installment of the AI Agents Foundations series in Decoding AI Magazine, I break down how to build agents that work with this reality instead of fighting it.

Here is what I will walk you through:

• Foundations of multimodal LLMs

• Practical implementation

• Multimodal state management

• Building the agent

If you want this lesson in your inbox the moment it goes live, subscribe to Decoding AI Magazine.

Link decodingai.com

Dec 8
at
2:37 PM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.