OpenAI just announced o3-pro. It is the best model of the world. But it’s also… too good for me to tell how good it is. This will become more common—and a huge problem.
The benchmarks are clear as are the evaluations of human testers: o3-pro is significantly better than o3. But do they know why? Can a normal person tell them apart? I believe the answer is no.
You need to be a world-class expert in math, code, science, etc. to find a question o3-pro can answer that o3 can’t.
Ben Hylak—prev. at SpaceX, Apple, now owns an AI company—said he had a hard time coming up with a question or task complex enough to realize just how smart o3-pro is.
It needed more context, he says. “It’s hard to capture in an eval.”
By his reaction, I can tell he doesn’t expect to be able to distinguish the best model from second-best model much longer.
What can we do as normal people with normal skills? Curate our trust. We are left delegating, once again, the ability to get a state-of-the-art picture of AI’s skills.
Crucially, we shouldn’t forget that the intelligence of AI models is jagged. The can achieve spectacular feats in mathematics yet remain unable to solve simpler riddles like ARC-AGI 2.
Why is that?
Because the models are smart, but not always in ways that feel familiar. And it’s the weirdness of the flaws, more than the flaws themselves, that’s going to catch us off guard.
If we don’t learn to navigate this, we’ll split into two camps:
The tiny slice of experts—and those of us who trust them—awed by o3-pro’s strange brilliance.
And those still clapping for Apple’s “The Illusion of Thinking,” happy to believe that AI is, of course, still dumb.
Which camp do you want to belong to?