ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

Liang Heng; Haoran Geng; Kaifeng Zhang; Pieter Abbeel; Jitendra Malik

ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

Liang Heng, Haoran Geng, Kaifeng Zhang, Pieter Abbeel, Jitendra Malik

Published: 09 Sept 2025, Last Modified: 19 Sept 2025CoRL 2025 RINOEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Dexterous Manipulation, Visuo-Tactile Fusion, Cross-Attention, Autoregressive Tactile Forecasting, Imitation Learning

TL;DR: ViTacFormer fuses vision and touch with cross-attention and tactile prediction, boosting success rates by 50% and completing the first 11-stage dexterous task with 2.5 minutes of continuous operation.

Abstract: Dexterous manipulation is crucial for robots to interact with the physical world. While vision-based methods have advanced rapidly, tactile sensing remains essential for fine-grained control, especially under occlusion. We present ViTacFormer, a cross-modal framework that fuses vision and touch via cross-attention and predicts future tactile states with an autoregressive head. A curriculum gradually shifts from ground-truth to predicted tactile inputs, stabilizing representation learning. On real-world benchmarks covering both short- and long-horizon tasks, ViTacFormer improves success rates by about 50% over strong baselines, and is the first to complete 11-stage dexterous manipulation with 2.5 minutes of continuous operation.

Supplementary Material: zip

Submission Number: 6

Loading