Exploring Visual Interpretability for Contrastive Language-Image PretrainingDownload PDF

22 Sept 2022 (modified: 08 Jun 2025)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Visual Interpretability, Explainability, Contrastive Language-Image Pretraining, Multimodality
TL;DR: A visual interpretability work for CLIP. We observe CLIP shows opposite visualization results, and find the reason is semantic shift at pooling layer. Then, we solve this problem with nontrivial improvements.
Abstract: Contrastive Language-Image Pre-training (CLIP) learns rich representations via readily available supervision of natural language. It improves the performance of downstream vision tasks, including but not limited to the zero-shot, long tail, segmentation, retrieval, caption, and video. However, the visual interpretability of CLIP is rarely studied, especially in the aspect of the raw feature map. To provide visual explanations of its predictions, we propose the Image-Text Similarity Map (ITSM). Based on it, we surprisingly find that CLIP prefers the background regions than the foregrounds, and shows erroneous visualization against human understanding. Experimentally, we find the devil is in the pooling part, where inappropriate pooling methods lead to a phenomenon called semantic shift. To correct and boost the visualization results, we propose the Masked Max Pooling, with attention map from the self-supervised image encoder. Meanwhile, interpretability and recognition require different representations. To address the problem, we propose the dual projections to cater this requirement. We integrate above methods as Interpretable Contrastive Language-Image Pre-training (ICLIP). Our experiments suggest that ICLIP greatly improves the interpretability of CLIP, e.g. nontrivial improvements at 32.85% and 49.10% on VOC 2012 dataset.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Social Aspects of Machine Learning (eg, AI safety, fairness, privacy, interpretability, human-AI interaction, ethics)
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/exploring-visual-interpretability-for/code)
6 Replies

Loading