Keywords: Robotics, manipulation
TL;DR: A novel framework that introduces an online intervention mechanism to correct the base VLA polic.
Abstract: Large-scale Vision-Language-Action (VLA) models excel at mapping natural language instructions to robotic action. However, they typically treat actions as terminal outputs with imitation learning often leads to execution bias, lacking mechanisms for dynamic supervision or online error correction. Meanwhile, World models (WM) have shown promise for predictive reasoning, but prior approaches typically require continuous frame-by-frame rollout of long sequences, resulting in high computational cost and limited flexibility. In this work, we propose VLA-in-the-Loop, a novel framework that introduces an online intervention mechanism to correct the base VLA policy. Our core innovation lies in the use of a lightweight, composite World Model, not for continuous state prediction, but as an on-demand, event-triggered “corrector.” When the VLA proposes a high-stakes action (e.g., closing the gripper), at this critical juncture, our composite WM first employs its discriminative component to evaluate the action’s feasibility. Should the proposed action be deemed unviable, a generative model synthesizes a short video of a successful future trajectory from the current state. Robot will be guided to the correct position using actions decoded by inverse dynamics mode(IDM) and execute a corrected, more robust action. This plug-in architecture is not only computationally efficient but also enhances data utilization by learning from potential failures, thereby significantly improving the robustness
of VLA models against online disturbances. We validate our framework across multiple robotic grasping tasks on both simulation and real-world systems, demonstrating the effectiveness of using world models not only for prediction, but as active agents for real-time
correction in VLA-based robotic systems.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 5405
Loading