ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

Published: 09 Sept 2025, Last Modified: 19 Sept 2025CoRL 2025 RINOEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Dexterous Manipulation, Visuo-Tactile Fusion, Cross-Attention, Autoregressive Tactile Forecasting, Imitation Learning
TL;DR: ViTacFormer fuses vision and touch with cross-attention and tactile prediction, boosting success rates by 50% and completing the first 11-stage dexterous task with 2.5 minutes of continuous operation.
Abstract: Dexterous manipulation is crucial for robots to interact with the physical world. While vision-based methods have advanced rapidly, tactile sensing remains essential for fine-grained control, especially under occlusion. We present ViTacFormer, a cross-modal framework that fuses vision and touch via cross-attention and predicts future tactile states with an autoregressive head. A curriculum gradually shifts from ground-truth to predicted tactile inputs, stabilizing representation learning. On real-world benchmarks covering both short- and long-horizon tasks, ViTacFormer improves success rates by about 50% over strong baselines, and is the first to complete 11-stage dexterous manipulation with 2.5 minutes of continuous operation.
Supplementary Material: zip
Submission Number: 6
Loading