gwern (@gwern): "Are humans that much better at text games than LLMs? "I organized an hour long event at Recurse Center, promised delicious donuts to everyone who came, and had them setup zork-bench on their laptops and play in human-eval model. The game logs all of their interactions using the…"

Make money doing the work you believe in

Are humans that much better at text games than LLMs? "I organized an hour long event at Recurse Center, promised delicious donuts to everyone who came, and had them setup zork-bench on their laptops and play in human-eval model. The game logs all of their interactions using the same interface as LLMs but gives them a random label. The thing is Humans new to the game seem to do only so well. They spend a lot of turns, play the game, and figure some stuff out, but then after the hour of playing I gave them, they didn’t get further than any LLM. However, their memories of the game persist without continuously reducing the size of their context windows. Haha. Do humans have context windows? But the point is that LLMs, having humanity’s entire knowledge of Zork stored in their memory banks, are unable to outperform humans who had not played Zork before (except for Claude Sonnet, which Isha Bhand, creator of fomo.nyc and Zork aficionado, declared as evidence that AGI has been achieved)."

zork-bench: An LLM reasoning eval based on text adventure games

Growing up in the 90s I would go to the library and find books on computers. Most of these books were already out of date, containing printed Apple BASIC programs that you could try to copy in and get to work. My favorite one was an F-14 Tomcat simulator. I never got this to work. At the time my eight year old brain didn’t conceive that Applesoft BASIC …

Low Impact Fruit

fomo.nyc

Fomo NYC

Jun 4

4:58 AM

Make money doing the work you believe in

zork-bench: An LLM reasoning eval based on text adventure games

Log in or sign up