Advancing Interpretability of CLIP Representations with Concept Surrogate Model

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Surrogate Model, Concept-based explanations, Representation Learning, Interpretability, CLIP
Abstract: Contrastive Language-Image Pre-training (CLIP) generates versatile multimodal embeddings for diverse applications, yet the specific information captured within these representations is not fully understood. Current explainability techniques often target specific tasks, overlooking the rich, general semantics inherent in the representations. Our objective is to reveal the concepts encoded in CLIP embeddings by learning a surrogate representation, which is expressed as a linear combination of human-understandable concepts evident in the image. Our method, which we term EXPLAIN-R, introduces a novel approach that leverages CLIP's learned instance-instance similarity to train a surrogate model that faithfully mimics CLIP's behavior. From the trained surrogate, we derive concept scores for each input image; these scores quantify the contribution of each concept and act as the explanation for the representation. Quantitative evaluations on multiple datasets demonstrate our method's superior faithfulness over the baseline. Moreover, a user study confirms that our explanations are perceived as more relevant, complete, and useful. Our work provides a novel approach for interpreting CLIP image representations, enhancing the user interpretability of representations and fostering more trustworthy AI systems.
Supplementary Material: zip
Primary Area: Social and economic aspects of machine learning (e.g., fairness, interpretability, human-AI interaction, privacy, safety, strategic behavior)
Submission Number: 23041
Loading