Workshop Statement: Our work, "Tool-as-Interface: Learning Robot Tool Use from Human Play through Imitation Learning," contributes to the broader goal of human-centered robot learning by leveraging the natural, intuitive behavior of humans interacting with tools to teach robots complex manipulation tasks. Unlike traditional teleoperation-based methods, which often require specialized hardware, extensive expertise, and precise control, our framework taps into human play—spontaneous, accessible interactions with everyday tools—as a scalable and cost-effective data source for robot learning. This approach centers human behavior as the interface for skill acquisition, enabling more inclusive participation in data collection and reducing the barrier to entry for training general-purpose robot policies.
From a technical standpoint, our framework bridges critical gaps in cross-embodiment and cross-viewpoint generalization, key challenges in human-centered learning. By abstracting actions into tool-centric, task-space representations and removing embodiment-specific information via visual segmentation, we enable policies to transfer from humans to robots with vastly different morphologies. Furthermore, we use two-view 3D reconstruction and Gaussian splatting to augment human play data with novel viewpoints, promoting view-invariant policy learning and robust performance under camera and robot base perturbations. The result is a system that empowers robots to learn precise, contact-rich, and dynamic skills—such as pan flipping or wine bottle balancing—directly from raw human demonstrations, without requiring teleoperation or simulation. In doing so, our method promotes a more human-centric, accessible, and scalable paradigm for robot learning.
Keywords: Tool Use, Data Collection
Abstract: Tool use is essential for enabling robots to perform complex real-world tasks, but learning such skills requires extensive datasets. While teleoperation is widely used, it is slow, delay-sensitive, and poorly suited for dynamic tasks. In contrast, human videos provide a natural way for data collection without specialized hardware, though they pose challenges on robot learning due to viewpoint variations and embodiment gaps. To address these challenges, we propose a framework that transfers tool-use knowledge from humans to robots. To improve the policy's robustness to viewpoint variations, we use two RGB cameras to reconstruct 3D scenes and apply Gaussian splatting for novel view synthesis. We reduce the embodiment gap using segmented observations and tool-centric, task-space actions to achieve embodiment-invariant visuomotor policy learning. Our method achieves a 71\% improvement in task success and a 77\% reduction in data collection time compared to diffusion policies trained on teleoperation with equivalent time budgets. Our method also reduces data collection time by 41\% compared with the state-of-the-art data collection interface.
Submission Number: 9
Loading