Avik De (@avikde): "Yeah, there’s no doubt that Physical Intelligence’s work is very impressive, but it isn’t possible to answer questions like “would equivalent performance have been possible, or would training have been easier, with a smaller generalist model utilizing extravisual sensory informa…"

The app for independent voices

Yeah, there’s no doubt that Physical Intelligence’s work is very impressive, but it isn’t possible to answer questions like “would equivalent performance have been possible, or would training have been easier, with a smaller generalist model utilizing extravisual sensory information” because there is no equivalent non-visual large-scale data that they had available to use for their pretraining scheme.

In terms of tasks, my intuition is that large-scale VLA success is built on the fact that visual and tactile data are “coupled through motion” in many scenarios — for example, you can infer the inertia of the knife based on how it is moving as a result of actions, and the feedback of the policy can stabilize the motion for quasistatic motion. This is pure speculation, but it’s possible that that will be more difficult with highly dynamic motions, or potentially behaviors where there are large forces involved without much motion.

My intuition is that the tactile or force feedback could be incorporated in some low-level impedance / preflex / reflex controllers coupled to the generalist model, but I know that there are opposing prominent views that any hand-designed structure will bottleneck performance.

Jan 23

4:08 PM

The app for independent voices

Log in or sign up