Vision-Language-Action Pretraining from Large-Scale Human Videos

04 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision-language-action models, robotics, learning from video
Abstract: Existing Vision-Language-Action models (VLA) struggle with complex manipulation tasks requiring high dexterity and generalization, primarily due to their reliance on synthetic data with significant sim-to-real gaps or limited teleoperated demonstrations. To address this bottleneck, we propose leveraging human hands as a ``manipulator template'', capitalizing on the rich dexterity and scalability present in web data of human manipulation. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, perspective spatial alignment for reasoning in a unified physical space, and post-training adaptation in physical environment. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our paradigm, we develop a comprehensive data curation pipeline that integrates heterogeneous sources --- including motion capture, VR, and RGB-only videos --- into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of our model in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains in robotic dexterous manipulation as physical instruction tuning is applied.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 1911
Loading