Keywords: Robotics, UMI, Tactile Sensing, Contact-Rich Manipulation, Diffusion Policy, Imitation Learning, Behavior Cloning
TL;DR: A novel multimodal data collection system for robotic imitation learning.
Abstract: Contact-rich manipulation remains a fundamental challenge in robot learning, in part because contact events are brief, highly variable, and not fully captured by vision alone. We present \textbf{PolyUMI}, a real-time multi-modal data collection and control platform that unifies four sensing modalities in a single end-effector: optical tactile sensing, mechanical vibration (via contact microphone), egocentric vision, and proprioception. Building on the Universal Manipulation Interface (UMI) handheld gripper framework, PolyUMI adds a custom touch-sensing finger inspired by PolyTouch, delivering synchronized streams of tactile video, contact audio, wrist camera video, and pose data---all from a fully wireless, battery-powered gripper. The system also supports an end-effector for the Franka Panda arm that preserves the same sensor geometry as the handheld gripper to facilitate policy transfer. We describe the hardware, firmware, and software architecture of PolyUMI and discuss its potential as a platform for studying how tactile and auditory sensing can complement vision in learning contact-rich manipulation policies.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Video: mp4
Submission Number: 30
Loading