Aligning Touch, Vision, and Language for Multimodal Perception

Published: 26 Oct 2024, Last Modified: 03 Dec 2024WTPEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal learning, dataset
Abstract: Touch, a crucial human sensing modality, has been absent from multimodal generative language models due to challenges in labeling tactile data. This work addresses this gap by leveraging the simultaneous collection of tactile and visual data, allowing GPT-4V to generate pseudo-labels from visual observations alone. The resulting dataset comprises 44K vision-touch pairs with English labels (10% human-annotated, 90% GPT-4V pseudo-labels). A touch-vision-language (TVL) model trained on this dataset shows improved tactile-vision-language alignment (+29% classification accuracy) over existing models and outperforms GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark.
Supplementary Material: pdf
Spotlight Video: zip
Submission Number: 6
Loading