Emergence of Text Semantics in CLIP Image Encoders

Published: 10 Oct 2024, Last Modified: 04 Nov 2024UniRepsEveryoneRevisionsBibTeXCC BY 4.0
Supplementary Material: zip
Track: Extended Abstract Track
Keywords: clip, image encoders, representations, interpretability
TL;DR: We find that CLIP image representations encode textual semantics while being robust to visual attributes like font.
Abstract: Certain self-supervised approaches to train image encoders, like CLIP, align images with their text captions. However, these approaches do not have an a priori incentive to learn to associate text inside the image with the semantics of the text. Our work studies the semantics of text rendered in images. We show evidence suggesting that the image representations of CLIP have a subspace for textual semantics that abstracts away fonts. Furthermore, we show that the rendered text representations from the image encoder only slightly lag behind the text representations with respect to preserving semantic relationships.
Submission Number: 65
Loading