Exploring Visual Interpretability for Contrastive Language-Image Pretraining

Yi Li; Hualiang Wang; Yiqun Duan; Hang Xu; Xiaomeng Li

Exploring Visual Interpretability for Contrastive Language-Image Pretraining

Yi Li, Hualiang Wang, Yiqun Duan, Hang Xu, Xiaomeng Li

22 Sept 2022 (modified: 12 Oct 2025)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Visual Interpretability, Explainability, Contrastive Language-Image Pretraining, Multimodality

TL;DR: A visual interpretability work for CLIP. We observe CLIP shows opposite visualization results, and find the reason is semantic shift at pooling layer. Then, we solve this problem with nontrivial improvements.

Abstract: Contrastive Language-Image Pre-training (CLIP) learns rich representations via readily available supervision of natural language. It improves the performance of downstream vision tasks, including but not limited to the zero-shot, long tail, segmentation, retrieval, caption, and video. However, the visual interpretability of CLIP is rarely studied, especially in the aspect of the raw feature map. To provide visual explanations of its predictions, we propose the Image-Text Similarity Map (ITSM). Based on it, we surprisingly find that CLIP prefers the background regions than the foregrounds, and shows erroneous visualization against human understanding. Experimentally, we find the devil is in the pooling part, where inappropriate pooling methods lead to a phenomenon called semantic shift. To correct and boost the visualization results, we propose the Masked Max Pooling, with attention map from the self-supervised image encoder. Meanwhile, interpretability and recognition require different representations. To address the problem, we propose the dual projections to cater this requirement. We integrate above methods as Interpretable Contrastive Language-Image Pre-training (ICLIP). Our experiments suggest that ICLIP greatly improves the interpretability of CLIP, e.g. nontrivial improvements at 32.85% and 49.10% on VOC 2012 dataset.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Social Aspects of Machine Learning (eg, AI safety, fairness, privacy, interpretability, human-AI interaction, ethics)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/exploring-visual-interpretability-for/code)

6 Replies

Loading