Cross-modality Interpretable image classification via Concept Decomposition Vector of Visual Language Models
Inherently interpretable image classification is valuable for high-risk decision-making. Recent works achieve competitive performance against black-box models by combining visual language models (VLM) with concept bottleneck models (CBMs). Their explanations are achieved by the weighted sum of similarities between the image representation and embeddings of pre-defined texts. However, using text only is not sufficient to represent visual information and the choices of texts are subjective, resulting in potential compromises in both interpretations and performance. Therefore, this work explores cross-modality interpretation of critical concepts in image classification. Specifically, we build CBM with a set of decomposed visual concepts learned from images rather than pre-defined text concepts, namely decomposed concept bottleneck model (DCBM). The decomposition is implemented by vector projection to concept decomposition vectors (CDVs). To explain CDVs in different modalities, a quintuple notion of concepts and a concept-sample distribution are proposed. Experiments indicate a competitive performance of DCBM with non-interpretable models and superior interpretability compared to other CBMs in terms of sparsity, groundability, factuality, fidelity, and meaningfulness.