- Abstract: Visual grounding of language is an active research field aiming at enriching text-based representations with visual information. In this paper, we propose a new way to leverage visual knowledge for sentence representations. Our approach transfers the structure of a visual representation space to the textual space by using two complementary sources of information: (1) the cluster information: the implicit knowledge that two sentences associated with the same visual content describe the same underlying reality and (2) the perceptual information contained within the structure of the visual space. We use a joint approach to encourage beneficial interactions during training between textual, perceptual, and cluster information. We demonstrate the quality of the learned representations on semantic relatedness, classification, and cross-modal retrieval tasks.
- Keywords: multimodal, sentence, representation, embedding, grounding
- TL;DR: We propose a joint model to incorporate visual knowledge in sentence representations