Can Decoupling Embedded Text from Images Improve Multimodal Learning?

Published: 19 Mar 2024, Last Modified: 11 May 2024Tiny Papers @ ICLR 2024 PresentEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Learning, Vision Language Models, Foundation Models, Social Computing, Hate Speech, Computational Social Science
TL;DR: We removed text from text-embedded images such as memes and investigated the effects it had on the learned representations of CLIP.
Abstract: Multimodal models have widely been used to process text-embedded images on social media. However, the effect of embedded text on the image encoding process remains unexplored. In this work, we eliminated the text in text-embedded images and compared the intervention's effect on the performance of unimodal and multimodal models. We find that the image encoders of multimodal models utilize linguistic information in the pixel space to a considerable degree. Further, we observe that disentangling text and images can improve multimodal learning under certain circumstances.
Supplementary Material: zip
Submission Number: 148
Loading