Humanoids, VLAs, and the Race to Build Robots That Can Do Anything
Authors: Jasper Platz and Neel Mehta | G2 Venture Partners
Today’s industrial robots are one-trick ponies. The hardware is complex yet dumb. They can execute precise movements at high speeds. However, they must be meticulously programmed for each specific task, limiting their flexibility. If you need them to do something else, you need to unbolt them, move them, and reprogram them. The hardware is expensive, but the custom software and implementation costs are just as significant, usually making up 50–80% of the total deployment costs. These high upfront costs and complex implementation requirements severely restrict industrial automation today. As a result, the vast majority of robots today are in high-throughput, low-mix applications, allowing for fast amortization of the high upfront investment.
The holy grail of robotics is a combination of versatile, low-cost hardware and deployment so simple it’s like getting a new iPhone. Vision-language-action models (VLAs) might unlock this new future of robotics.
From LLMs to VLAs
Large language models (LLMs) are great at generating human-like text. We can ask them questions, and they have the answers backed by the wisdom of every written word on the internet. LLMs benefit from vast amounts of internet text data. Vision-language-action models (VLAs) are adaptations of LLMs. They are trained on robot-specific training data. They take in human commands and can perceive their environment using cameras and other sensors. But instead of generating words, they produce instructions for robots to do useful things.
VLAs work something like this:
- Human prompts robot with a task
- VLA perceives the current physical position of the robot
- VLA reasons to generate the next step to accomplish the task
- VLA issues commands to the robot
- Steps 2–4 are repeated until the task is accomplished
Comparing this to the old way of custom-programming a robot, a couple of things jump out:
- We can instruct the robot in human language, not custom software code.
- The VLA takes complex “chain of thought” instructions and parses them into small bites of actions that a robot can execute (e.g. move an arm a few inches to a new position), eliminating the need to spell out every little detail of a complex sequence of motions to accomplish a task.
VLAs may eliminate the need for costly, time-consuming custom software and implementation work — one of the biggest hurdles in deploying traditional robots.
The Path to Humanoids and General Physical AI
How far away are we from General Physical AI and embodied form factors like humanoids that can pick up any task in our homes or industrial settings?
On the one hand, real commercial progress has been disappointingly slow. Boston Dynamics has been churning out impressive demos for over a decade, and Honda had a walking robot called ASIMO 25 years ago. Yet commercial deployments are stuck in pilots.
But the last 12 months feel different. The speed of development seems to be accelerating. A wave of Physical AI startups is emerging, each racing to define the future of robotics. Many big tech companies have announced or are rumored to be working on humanoid projects. We’re seeing incredible demos by all the leading companies, including Figure.ai, Agility, Boston Dynamics, Tesla and Mentee Robotics.
So how do we assess real progress towards General Physical AI? Here are the main challenges and what to look out for:
1 — Task Generalization and Training Data
VLAs are disadvantaged compared to LLMs when it comes to training data. LLMs can be trained on vast datasets scraped from the internet, while robotic training data is much harder to obtain. Open-source data sets are limited and the signal from whatever data is available can be sparse. Beyond that, teleoperating robots has been the next best alternative, but it’s expensive and slow. Many believe that simulated training data will be the answer and companies like Skild.ai are working on scalable training data. But simulated data is unproven in providing the edge-case depth and quality needed. Plus, there is the unsolved problem of “sim2real” differences: the real world is complex and hard to model perfectly. For example, in a simulation, pushing a soda with your finger would always result in a consistent outcome where the can ends up. In reality, a bit of dirt under the can could change the outcome every time.
The jury is still out on how much training data is really needed to get to General Physical AI. If a robot can put away groceries in some kitchens it was trained on, does that mean it can do so in other kitchens? Can a robot that can paint a wall also paint a cabinet? The list goes on.
The generalization of training data to other related or unrelated tasks is a key factor to keep an eye on. There seems to be no silver bullet today.
2 — Hardware is Hard
Building new hardware products is notoriously difficult. Development cycles are long, and mistakes are expensive.
Untethered robots like humanoids have to solve the problem of battery life and onboard compute resources to run compute-intensive AI models. To truly mimic humans, humanoid robots must achieve high degrees of freedom and remarkable dexterity. Any moving part adds complexity and a potential point of failure.
All that complexity casts a shadow on the claim that humanoids are the only form factor for all industrial applications. It comes down to the tradeoff between versatility (doing what humans can do), specialization (optimized hardware for specific tasks) and cost. In a recent conversation with a major industrial buyer of robotics they emphasized that they care little about the form factor or how general-purpose it is. It’s all about the robot’s ability to perform a given task effectively at the lowest cost per task. The punch line is: we don’t think vertical solutions are going away.
Lastly, grasping is a massive challenge in robotics. Human hands are a marvel of evolution. Robots struggle with many real-world challenges, like grasping soft items with the right force to avoid crushing them. They also struggle to handle simple tools like a knife or a tape roll for precise tasks (e.g., opening/closing a shipping box). The fact that various startups are solely focused on robotic hands illustrates the challenge manipulation and grasping represents in robotics.
There’s a prevailing sentiment that all the value in robotics lies in intelligence and foundation models, but hardware remains a crucial differentiator. Consider the sheer complexity and number of engineering decisions that go into building a humanoid — battery sizing, compute power, type and number of actuators, motors, sensors, and manipulators. Each of these choices has profound implications for performance, reliability, and cost, making hardware a defining factor in robotics development.
3 — Reliability and Safety
It’s been 20 years since the original self-driving DARPA challenge and 10 years since the first autonomous cross-country trip from San Francisco to NYC. Yet only recently has Waymo managed to generate meaningful revenue for its robo-taxi service. The road to autonomy is littered with failed startups and huge R&D spend with no revenue to show for it.
Getting from an awe-inspiring demo to a commercially viable product is a huge challenge in robotics. Failures can injure or kill people. Self-driving cars can run over pedestrians. Humanoids can lose their balance and crush someone. A robotic arm can knock someone over. And even if injuries can be avoided, reliability is a major concern for commercial customers. Who fixes the bot if it breaks? What are the resulting production losses and costs?
There is a long list of OSHA and ISO/ANSI safety standards that govern the use of robotics in a workplace. Most robots today are locked up in safety cells to prevent injuries, which significantly adds to their deployment costs and requires redesign of production facilities. For humanoids to work alongside humans, companies will have to build robust safety cases on stochastic models around fall trajectories and traceability of actions. How do you troubleshoot a failure if it was triggered by a black-box AI model? We’re just at the beginning of addressing these problems.
Final thoughts
The road to General Physical AI is filled with challenges, but the momentum is undeniable and the speed of improvement often mindboggling. With rapid advancements in AI models, improving training data, and evolving hardware, we are getting closer to breaking through the barriers that have long constrained robotics. The next few years will determine whether humanoids can move beyond impressive demos to become truly useful, safe, and cost-effective. If the breakthroughs continue at their current pace, we might soon see robots not just in controlled factory settings but seamlessly integrated into everyday life.