Keywords: Interpretable AI, Explainable AI, Feature decomposition
Abstract: The CLIP model has demonstrated strong capabilities in capturing the relationship between images and text through its learned high-dimensional representations. However, these dense features primarily express similarity via cosine distance, offering limited insight into the underlying causes of that similarity. Recent efforts have explored sparse decomposition techniques to extract semantically meaningful components from CLIP features as a form of interpretation. Nevertheless, we argue that these methods treat each modality independently, resulting in inconsistent decompositions that fail to reflect the cross-modal similarity from the aspect of concepts.
In this paper, we introduce an explanation method for CLIP similarity via Dual Modalities Decomposition, CLIP-DMD, which employs a Sparse Autoencoder (SAE) to learn sparse decompositions of both CLIP image and text features within a shared concept space. To enhance interpretability, we propose two novel objectives: a Rate Constraint ($RC$) Loss, which promotes the crucial concepts to dominate the overall similarity, and a Corpus Cycle Consistency ($C^3$) Loss, which ensures that the most responsive features are both distinctive and accurately recognized by the encoder. To assess interpretability, we also design an evaluation protocol leveraging Large Language Models (LLMs) to provide automated and human-aligned assessments. Experimental results show that CLIP-DMD not only achieves competitive zero-shot classification, retrieval, and linear probing performance, but also delivers more human-understandable, reasonable, and preferable explanations of CLIP similarity compared to prior methods.
Primary Area: interpretability and explainable AI
Submission Number: 1316
Loading