Vision-Language-Action Pretraining from Large-Scale Human Videos

Hao Luo; Yicheng Feng; Wanpeng Zhang; Sipeng Zheng; Ye Wang; Haoqi Yuan; jiazheng liu; Chaoyi Xu; Haiweng Xu; Qin Jin; Zongqing Lu

Vision-Language-Action Pretraining from Large-Scale Human Videos

Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, jiazheng liu, Chaoyi Xu, Haiweng Xu, Qin Jin, Zongqing Lu

04 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision-language-action models, robotics, learning from video

Abstract: Existing Vision-Language-Action models (VLA) struggle with complex manipulation tasks requiring high dexterity and generalization, primarily due to their reliance on synthetic data with significant sim-to-real gaps or limited teleoperated demonstrations. To address this bottleneck, we propose leveraging human hands as a ``manipulator template'', capitalizing on the rich dexterity and scalability present in web data of human manipulation. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, perspective spatial alignment for reasoning in a unified physical space, and post-training adaptation in physical environment. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our paradigm, we develop a comprehensive data curation pipeline that integrates heterogeneous sources --- including motion capture, VR, and RGB-only videos --- into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of our model in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains in robotic dexterous manipulation as physical instruction tuning is applied.

Supplementary Material: zip

Primary Area: applications to robotics, autonomy, planning

Submission Number: 1911

Loading