H2R: A Human-to-Robot Data Augmentation for Robot Pre-training from Videos

Published: 06 May 2025, Last Modified: 06 May 2025SynData4CVEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Robot Learning: Model Learning, Robot Perception, Sensing & Vision, Robot Learning: Imitation Learning
Abstract: Large-scale pre-training using videos has proven effective for robot learning, as it enables the model to acquire task knowledge from first-person human operation data that reveals how humans perform tasks and interact with their environment. However, the models pre-trained on such data can be suboptimal for robot learning due to the significant visual gap between human hands and those of different Large-scale pre-training using videos has proven effective for robot learning, as it enables the model to acquire task knowledge from first-person human operation data that reveals how humans perform tasks and interact with their environment. However, the models pre-trained on such data can be suboptimal for robot learning due to the significant gap between human hands and those of different robots. To remedy this, we propose H2R, a simple data augmentation technique for robot pre-training from videos, which extracts the human hands from first-person videos and replaces them with those of different robots to generate new video data for pre-training. Specifically, we commence by detecting the 3D position and key points of human hands. This detected information then serves as the basis for generating robots in the simulation environment that exhibit similar motion postures. Following this, we calibrate the intrinsic parameters of the simulator camera and the first-person human camera. Finally, we overlay the images depicting the motion states of the robotic arm within the simulator onto the corresponding images of human hand motions. H2R fills the gap in the conversion process from the human hand to the robotic arm from the visual level. We conduct extensive experiments on a range of robotic tasks, ranging from standard simulation benchmarks to robotic real-world tasks. The experimental results show that H2R can improve the representation capability of visual encoders pre-trained by various methods, whether using imitation learning or reinforcement learning as the paradigm of downstream robot policy learning.
Supplementary Material: zip
Submission Number: 76
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview