Cross-Modality Image Interpretation via Concept Decomposition Vector of Visual-Language Models

Zhengqing Fang; Zhouhang Yuan; Ziyu Li; Jingyuan Chen; Kun Kuang; Yu-Feng Yao; Fei Wu

Cross-Modality Image Interpretation via Concept Decomposition Vector of Visual-Language Models

Zhengqing Fang, Zhouhang Yuan, Ziyu Li, Jingyuan Chen, Kun Kuang, Yu-Feng Yao, Fei Wu

Published: 01 Jan 2025, Last Modified: 15 May 2025IEEE Trans. Circuits Syst. Video Technol. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Interpretable image classification is crucial for making decisions in high-stakes scenarios. Recent advancements have demonstrated that interpretable models can achieve performance comparable to black-box models by integrating Visual Language Models (VLMs) with Concept Bottleneck Models (CBMs). These models explain their predictions by calculating the weighted sum of similarities between the image representation and predefined text embeddings. However, selecting textual descriptors is subjective, and relying solely on textual information may not capture the complexities of visual data, impacting both interpretability and performance. To address these limitations, this work explores the cross-modality interpretation of class-related concepts in image classification. Specifically, we propose decomposed concept bottleneck model (DCBM), which utilizes a set of decomposed visual concepts that are extracted directly from images instead of predefined text concepts. The decomposition of concepts is achieved through vector projection onto concept decomposition vectors (CDVs), which can be interpreted across both textual and visual modalities. We introduce a quintuple notion of concepts and a concept-sample distribution theorem, which enables the localization of decomposed concepts in images using the Segment Anything Model (SAM) with automatically generated prompts. Experimental results demonstrate that DCBM achieves competitive performance compared to non-interpretable models, with a 3.42% improvement in classification accuracy and a 66.27% improvement in image-text groundability compared to other VLM-based CBMs. Furthermore, we evaluate the benefits of employing automatically generated prompts in SAM for interpreting visual concepts, in contrast to prompts created by human operators.

Loading