The robotics industry has a consensus problem. Nearly every startup and major research lab training robots to handle objects has converged on the same method: vision-language-action models that rely primarily on visual data to infer physical dynamics, betting that enough demonstrations and computing power will fill in the gaps.
Sze Yuan Cheong, co-founder and CEO of Devol Robots, argues that bet is unlikely to hold up reliably in production settings.
The result is systems that look capable in polished demos but stall out in real factories and warehouses, where conditions change constantly, and the margin for error is close to zero.
Cheong, a serial entrepreneur who spent more than a decade in manufacturing and industrial engineering before moving into robotics, believes the field needs to rebuild from the ground up. Through Devol Robots, he is building an alternative: a physical AI model that learns by feeling force and resistance, not just by watching.

Why Regular Training Models Fail In Real-World Settings
Across the robotics industry, startups and major labs have rallied around one training method: the vision-language-action model. A VLA system is trained on large volumes of demonstrations and visual data, mapping what it sees to the next action while inferring physical dynamics implicitly. It never directly measures how hard a gripper is squeezing or how much resistance a surface offers. The assumption is that with enough footage and enough computing power, the network will figure out the physics on its own.
For Sze Yuan Cheong, the core issue is that these systems do not explicitly model how a robot touches and handles objects. Instead, they ask vision to infer contact and force indirectly. These systems look impressive in controlled demos, but in messy, unpredictable real-world settings like factory floors or warehouses, sustaining production-level reliability becomes far more difficult.
A deeper bias in the field compounds the problem. Tasks a regular person would find effortless, like picking up a tilted lens and sliding it into a snug holder, are among the hardest for a robot. People rely on constant feedback from their fingers and muscles without thinking about it. A camera-only system has no equivalent sense of touch, and no amount of video can substitute for it.
“People try to solve it top-down, but they don’t care about the underlying layer,” he explains. “They think with enough data and compute, you just trust the network will solve everything underneath. We don’t think it will work.”

The Consequences: More Humans In The Mix
The issues that can come with robotic malfunction, Cheong believes, isn’t something engineers can tune their way out of. He argues the limitation is structural.
Industrial settings like assembly lines, logistics hubs, and retail fulfilment centers operate under heavily strict rules, with fixed cycle times and a low tolerance for failure, meaning every robot potentially installed on the floor has to justify its cost in hard returns.
When VLA models hit situations they weren’t trained on (and in unstructured environments, those situations can happen often), the standard fix is to put a human back in the loop, meaning an operator manually puts on a VR headset and takes over the robot remotely. In areas like e-commerce fulfilment and precision manufacturing, that handoff could easily happen every other minute if the robots aren’t well-trained. The supposed autonomy becomes a staffing problem with extra steps.
What’s worse is that teleoperation itself is a poor substitute for tasks or products needing a delicate touch. The operator can’t feel what the robot is touching through a headset, so when the job involves placing a high-value lens into a tight fixture, repeated blind attempts can risk damaging expensive product.
In Cheong’s view, this heavy reliance on teleoperation suggests the underlying control problem remains unsolved.
Devol Robots’ Solution
Cheong’s central thesis is simple: a robot can’t learn to handle the mechanics of a physical environment just by watching it. It needs to feel force and resistance like a person. Devol Robots applies that principle to machines.
The company’s model fuses torque and force data from each of the robot’s joints with stiffness parameters and vision, with the intent of creating a physics-grounded representation of how the robot’s entire body handles a variety of different objects and surfaces. This means that rather than relying purely on visual correlations, the model tracks how forces and motion evolve over time using a sequence-based neural architecture.
The difference here can be major, as it changes what the system actually understands. A conventional model learns to reproduce demonstrated trajectories from observation. Devol Robots’ model learns why that path was chosen: whether it was physically possible, whether it was the best option, and what made it better than the alternatives.
“We’re letting the robot experience the real world and learn, with the outcome linked back to the control and experience,” Cheong explains. “We don’t learn purely by looking at other people. If you’re told ‘do this,’ you won’t be able to learn; you need to try and feel by yourself. This is our thesis.”

What Comes Next
The prevailing strategy in robot AI starts from the top. Solve the high-level reasoning first (what the robot should do and why) and trust that enough data will sort out the messy physical details underneath. Cheong argues the field has it backwards. The smarter path, he believes, is to start from the bottom: meaning, to build a base layer grounded in real physics, then grow upward toward reasoning.
Devol Robots, now approaching three years since incorporation, fields an 18-person team and is deploying its model with U.S.-based industrial clients. The company plans to publish a white paper and peer-reviewed research that Cheong believes will challenge prevailing assumptions and offer a concrete alternative.
The race to deploy physical AI is accelerating, and professionals like Sze Yuan Cheong are betting that the winners will be the ones who got the physics right from the start.

