ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

TMLR Paper6113 Authors

06 Oct 2025 (modified: 10 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-scale spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based framework that robustly integrates visual and tactile input data to learn task-agnostic representations for visuotactile perception. Our key idea is to encode positional structure at two complementary levels that emerge naturally in visuotactile perception: local, within each modality, and global, shared across modalities to place their tokens in a common reference before fusion. Unlike prior work, we provide provable guarantees in visuotactile fusion, showing that our encodings are injective, translation-equivariant, and information-preserving, validating these properties empirically. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of ViTaPEs in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Evan_G_Shelhamer1
Submission Number: 6113
Loading