Meta's V-JEPA 2 was trained on a million hours of internet video. No robot data at all.
After fine-tuning on sixty-two hours of robot trajectories from an open-source dataset, the action-conditioned version was deployed zero-shot on Franka arms in two different labs. Neither lab appeared in the training data. The robots could pick and place objects by sampling candidate actions, predicting consequences through the world model, and selecting the action sequence whose predicted future best matched the goal.
The robot never trained on those labs. It planned through its predictive world model instead.
This is what world models contribute to embodied agents. Without one, the agent reacts. With one, the agent imagines.