Supplementary Material: zip
Track: Extended Abstract Track
Keywords: clip, image encoders, representations, interpretability
TL;DR: We find that CLIP image representations encode textual semantics while being robust to visual attributes like font.
Abstract: Certain self-supervised approaches to train image encoders, like CLIP, align
images with their text captions. However, these approaches do not have an a priori
incentive to learn to associate text inside the image with the semantics of the text.
Our work studies the semantics of text rendered in images. We show evidence
suggesting that the image representations of CLIP have a subspace for textual
semantics that abstracts away fonts. Furthermore, we show that the rendered
text representations from the image encoder only slightly lag behind the text
representations with respect to preserving semantic relationships.
Submission Number: 65
Loading