Keywords: vision language model, robotics, multi-modal, reward model, real world rl
TL;DR: Vision-Language-Action-Critic Model
Abstract: Recent advances in Vision-Language-Action (VLA) models have significantly improved robotic perception and manipulation capabilities. However, robots deployed in real-world settings still struggle to adapt in dynamic, open-ended environments due to a lack of reliable task progress feedback and improvement mechanisms. To address these challenges, we propose a generalist Vision Language Action-Critic model, VLAC, which can integrate both human and robot data, and unify action generation and task progress understanding within a single autoregressive architecture. Specifically, we propose a scalable and generalizable pair-wise progress understanding approach to predict the task progress delta between any two images in one visual trajectory, and generate the action based on the first image. The model is trained on large-scale, multi-source human data without action annotations and robot data with action information, while also incorporating general vision-language data yielding world knowledge understanding. Furthermore, we deploy reinforcement learning where VLAC can autonomously evaluate task progress to feedback intrinsic rewards. We evaluated our model's progress understanding across eight datasets and show that it not only generalizes to new tasks and environments but also discriminates success from failure trajectories, e.g., on RoboFAC dataset, it reaches VOC-F1 0.89 for successful versus 0.44 for failed trajectories, providing dependable dense reward signals. Then, we evaluated action generation and real-world reinforcement learning performance on diverse real-world robotic manipulation tasks. Experimental results indicate strong disturbance robustness in VLAC’s action generation, while integrating pairwise progress prediction allows real-world RL to improve success from roughly 30\% to 90\% within 200 episodes.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 8496
Loading