Tool-as-Interface: Learning Robot Policies from Observing Human Tool Use

Published: 18 Apr 2025, Last Modified: 07 May 2025ICRA 2025 FMNS OralBestPaperCandidateEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Manipulation, Imitation Learning, Foundation Models
Abstract: While teleoperation is widely used, it is slow, delay-sensitive, and poorly suited for dynamic tasks. In contrast, human demonstrations provide natural, hardware-free data collection, though they are difficult to leverage due to viewpoint variation and embodiment mismatch. We propose a framework that transfers tool-use knowledge from humans to robots. Using two RGB cameras, we reconstruct 3D scenes and apply Gaussian splatting for novel view synthesis, improving robustness to camera variations. We reduce the embodiment gap using segmented observations to enable human-to-robot transfer, and use tool-centric, task-space actions to achieve base-invariant visuomotor policy learning. Our method achieves a 71\% improvement in task success and a 77\% reduction in data collection time compared to diffusion policies trained on teleoperation with equivalent time budgets. Our method also outperforms the UMI gripper, reducing collection time by 41\%.
Submission Number: 22
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview