This is also another reason to be excited about world models, which potentially can start to understand these concepts and learn to ground the various spatial relationships that we talk about in reality. World models take in and predict video data; because of this, they can actually learn an implicit model of the world, of how objects interact, and learn (implicitly) to understand the 3D space that humans and embodied agents must inhabit.