OpenAI’s o3 model has a secret ability: it can play the GeoGuessr game (locate Google Street View-like pictures on a map) better even than human experts. But it doesn’t do it if you don’t ask with the right prompt. A super long prompt gets the best performance, whereas simply asking isn’t enough.
Why does it happen? How is it possible that crafting a ~1,000-word-long prompt is sufficient (and perhaps necessary) for o3 to display such mastery at a game no one trained it to solve? This question reminded me of an insight I first read from Gwern: “sampling can prove the presence of knowledge but not the absence.”
It’s not just GeoGuessr. There’s a non-negligible probability that AI models are much more competent (at some tasks) than their benchmark performance lets us see, because we are very bad at sampling. We are bad at coming up with the right prompts to wake up their latent abilities.