Hugo (@robonaissance): "Vision-language-action models are doing what philosophy of mind specified in 1990. Stevan Harnad named the symbol grounding problem that year. Classical AI systems shuffled symbols according to rules, but the symbols had no connection to the world outside the system. "Cat" mean…"

Make money doing the work you believe in

Vision-language-action models are doing what philosophy of mind specified in 1990.

Stevan Harnad named the symbol grounding problem that year. Classical AI systems shuffled symbols according to rules, but the symbols had no connection to the world outside the system. "Cat" meant whatever the other symbols said it meant. Harnad argued the regress had to bottom out somewhere, and he proposed the grounding had to come from sensorimotor experience.

The proposal sat for thirty years without an engineering answer. Statistical NLP succeeded without grounding. Multimodal models grounded text in images but not in action.

VLA models close the loop. In 2023, Google DeepMind's RT-2 trained a system that takes camera input and a natural language instruction and outputs robot action tokens. The token "place" no longer derives meaning from other tokens. It derives meaning from the motor program that gets executed and the world-state change that results.

OpenVLA and Figure's Helix encode the same architectural commitment. Language tokens and action tokens share an embedding space. Meaning is the regularity that maps from one to the other under sensory feedback.

Whether this solves the philosophical problem is a separate question. Harnad would likely say it depends on whether the grounding is constitutive or merely correlative. But the engineering is doing what the philosophy specified.

Robotics engineers may not know they are implementing Harnad. They are.

May 13

2:20 AM

Make money doing the work you believe in

Log in or sign up