I hit a frustrating wall when I first started building AI agents.
But the lessons I learned now define how I build enterprise-grade apps.
Let me explain...
Early on, I was extremely comfortable manipulating text.
But the moment I tried to add PDFs, images, and audio, my clean architectures collapsed.
I built pipelines that chained together:
• OCR engines for PDFs
• Layout detection for tables and diagrams
• Custom classifiers for images
It looks sophisticated.
But behaved like a brittle machine that broke every time a document layout changed.
The breakthrough came when I realised I was solving the wrong problem.
I did not need to convert documents to text...
I just needed to treat them as images.
... And let multimodal LLMs handle them natively.
Once I understood that:
• Every PDF page is effectively an image
• Modern LLMs can “see” as well as they can read
• Images, audio, and text all become tokens
The entire system simplified.
And the accuracy increased.
Here's the gist:
If you keep normalising everything to text, you are throwing away the information that matters most.
In the next installment of the AI Agents Foundations series in Decoding AI Magazine, I break down how to build agents that work with this reality instead of fighting it.
Here is what I will walk you through:
• Foundations of multimodal LLMs
• Practical implementation
• Multimodal state management
• Building the agent
If you want this lesson in your inbox the moment it goes live, subscribe to Decoding AI Magazine.
Link decodingai.com