VIP: Vision Instructed Pre-training for Robotic Manipulation

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose how to utilize vision instruction to improve robotic manipulation pre-training.
Abstract: The effectiveness of scaling up training data in robotic manipulation is still limited. A primary challenge in manipulation is the tasks are diverse, and the trained policy would be confused if the task targets are not specified clearly. Existing works primarily rely on text instruction to describe targets. However, we reveal that current robotic data cannot train policies to understand text instruction effectively, and vision is much more comprehensible. Therefore, we introduce utilizing vision instruction to specify targets. A straightforward implementation is training a policy to predict the intermediate actions linking the current observation and a future image. Nevertheless, a single future image does not describe the task target in insufficient detail. To handle this problem, we propose to use sparse point flows to provide more detailed information. Extensive tasks are designed based on real and simulated environments to evaluate the effectiveness of our vision instructed pre-training (VIP) method. The results indicate VIP improves the performance on diverse tasks significantly, and the derived policy can complete competitive tasks like ``opening the lid of a tightly sealed bottle''.
Lay Summary: This work points out that vision instruction is more comprehensible than text instruction for current embodied policies and develops a novel manipulation pre-train paradigm based on sparse point flow.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/Lizhuoling/VIRT
Primary Area: Applications->Robotics
Keywords: vision instruction, robotic manipulation pre-training, imitation learning
Submission Number: 1146
Loading