ConViTac: Aligning Visual-Tactile Fusion with Contrastive Representations

Zhiyuan Wu; Yongqiang Zhao; Shan Luo

ConViTac: Aligning Visual-Tactile Fusion with Contrastive Representations

Zhiyuan Wu, Yongqiang Zhao, Shan Luo

Published: 25 Sept 2025, Last Modified: 25 Sept 2025IROS 2025 Workshop Tactile Sensing OralPosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual-Tactile Fusion, Contrastive Learning

Abstract: We propose ConViTac, a visual-tactile representation learning network designed to enhance the alignment of features during fusion using contrastive representations. Our key contribution is a Contrastive Embedding Conditioning (CEC) mechanism that leverages a contrastive encoder pretrained through self-supervised contrastive learning to project visual and tactile inputs into unified latent embeddings. These embeddings are used to couple visual-tactile feature fusion through cross-modal attention, aiming at aligning the unified representations and enhancing performance on downstream tasks. We conduct extensive experiments to demonstrate the superiority of ConViTac in real world over current state-of-the-art methods and the effectiveness of our proposed CEC mechanism, which improves accuracy by up to 12.0\% in material classification and grasping prediction tasks.

Submission Number: 16

Loading