Camera Ready Pdf: pdf
Keywords: Dexterous Manipulation, Visuo-Tactile Fusion, Cross-Attention, Autoregressive Tactile Forecasting, Imitation Learning
TL;DR: ViTacFormer fuses vision and touch with cross-attention and tactile prediction, boosting success rates by 50% and completing the first 11-stage dexterous task with 2.5 minutes of continuous operation.
Abstract: Dexterous manipulation is crucial for robots to interact with the physical world. While vision-based methods have advanced rapidly, tactile sensing remains essential for fine-grained control, especially under occlusion. We present ViTacFormer, a cross-modal framework that fuses vision and touch via cross-attention and predicts future tactile states with an autoregressive head. A curriculum gradually shifts from ground-truth to predicted tactile inputs, stabilizing representation learning. On real-world benchmarks covering both short- and long-horizon tasks, ViTacFormer improves success rates by about 50\% over strong baselines, and is the first to complete 11-stage dexterous manipulation with 2.5 minutes of continuous operation.
Supplementary Material: zip
Submission Number: 5
Loading