Workshop Statement: Our work, "Tool-as-Interface: Learning Robot Tool Use from Human Play through Imitation Learning," contributes to the broader goal of human-centered robot learning by leveraging the natural, intuitive behavior of humans interacting with tools to teach robots complex manipulation tasks. Unlike traditional teleoperation-based methods, which often require specialized hardware, extensive expertise, and precise control, our framework taps into human play—spontaneous, accessible interactions with everyday tools—as a scalable and cost-effective data source for robot learning. This approach centers human behavior as the interface for skill acquisition, enabling more inclusive participation in data collection and reducing the barrier to entry for training general-purpose robot policies.
From a technical standpoint, our framework bridges critical gaps in cross-embodiment and cross-viewpoint generalization, key challenges in human-centered learning. By abstracting actions into tool-centric, task-space representations and removing embodiment-specific information via visual segmentation, we enable policies to transfer from humans to robots with vastly different morphologies. Furthermore, we use two-view 3D reconstruction and Gaussian splatting to augment human play data with novel viewpoints, promoting view-invariant policy learning and robust performance under camera and robot base perturbations. The result is a system that empowers robots to learn precise, contact-rich, and dynamic skills—such as pan flipping or wine bottle balancing—directly from raw human demonstrations, without requiring teleoperation or simulation. In doing so, our method promotes a more human-centric, accessible, and scalable paradigm for robot learning.
Keywords: Manipulation, Imitation Learning
Abstract: Tool use is critical for enabling robots to perform complex real-world tasks, and leveraging human tool-use data can be instrumental for teaching robots. However, existing data collection methods like teleoperation are slow, prone to control delays, and unsuitable for dynamic tasks. In contrast, human play—where humans directly perform tasks with tools—offers natural, unstructured interactions that are both efficient and easy to collect.
Building on the insight that humans and robots can share the same tools, we propose a framework to transfer tool-use knowledge from human play to robots. Using two RGB cameras, our method generates 3D reconstruction, applies Gaussian splatting for novel view augmentation, employs segmentation models to extract embodiment-agnostic observations, and leverages task-space tool-action representations to train visuomotor policies.
We validate our approach on diverse real-world tasks, including meatball scooping, pan flipping, wine bottle balancing, and other complex tasks. Our method achieves a 71\% higher average success rate compared to diffusion policies trained with teleoperation data and reduces data collection time by 77\%, with some tasks solvable only by our framework. Compared to hand-held gripper, UMI, our method cuts data collection time by 41\%. Additionally, our method bridges the embodiment gap, improves robustness to variations in camera viewpoints and robot configurations, and generalizes effectively across objects and spatial setups.
Supplementary Material: zip
Submission Number: 9
Loading