Fei-Fei Li and the World Labs team just published a nice taxonomy of world model: drfeifei.substack.com/p…. It sorts the many things now called "world models" into three functions: a renderer that outputs pixels, a simulator that outputs faithful geometry and physics, and a planner that outputs actions. It's a genuinely useful piece for a field where one phrase is being stretched across video models, physics engines, and robot controllers all at once.
But what if we don't need a model of the world at all? Generative models don't contain anything like a simulator or an inner scene. They encode the sequential structure of the signal and that is sufficient to "continue" any sequence. Language, music, video, in each case the generators demonstrate that learning to continue sequences buys you the behavior—syntax, melody, "physics"—without the need to model anything explicitly. There is just the structure of the data; no causal inference to a latent model.
Fei-Fei says: "The world is not made of words." True. But the generative approach isn't confined to words. It works on the physical world too. Autoregressive generators produce sequences where objects fall, collide, occlude one another, where bodies, including human ones, move with subtle, accurate mechanics. None of this comes from a physics engine running underneath. No rigid bodies, no collision solver, no gravity constant, no skeletal rig, no underlying objects at all in any explicit sense. The model predicts the next frame from the frames before it, and the physics shows up because physics already shaped the data it learned from. Now build this into a closed loop, where the next step generated is an action and its consequence becomes the next input, and continuation alone starts to look like an agent moving through a world.
The world enforces its constraints on the sensory data. Generative models show you don't need to recover those constraints, just the fingerprints they leave on the data itself.
So the question the taxonomy never quite asks: are world models necessary, for AI or for us? Maybe AI doesn't need them, and maybe we don't either.
I made a fuller case here elanbarenholtz.substack….