Are humans that much better at text games than LLMs? "I organized an hour long event at Recurse Center, promised delicious donuts to everyone who came, and had them setup zork-bench on their laptops and play in human-eval model. The game logs all of their interactions using the same interface as LLMs but gives them a random label. The thing is Humans new to the game seem to do only so well. They spend a lot of turns, play the game, and figure some stuff out, but then after the hour of playing I gave them, they didn’t get further than any LLM. However, their memories of the game persist without continuously reducing the size of their context windows. Haha. Do humans have context windows? But the point is that LLMs, having humanity’s entire knowledge of Zork stored in their memory banks, are unable to outperform humans who had not played Zork before (except for Claude Sonnet, which Isha Bhand, creator of fomo.nyc and Zork aficionado, declared as evidence that AGI has been achieved)."