Concept-Guided Dictionary Learning for Interpretable Concept Extraction and Attribution in Large Vision–Language Models

Md Abdul Kadir; Hasan Md Tusfiqur Alam; Daniel Sonntag

Concept-Guided Dictionary Learning for Interpretable Concept Extraction and Attribution in Large Vision–Language Models

Md Abdul Kadir, Hasan Md Tusfiqur Alam, Daniel Sonntag

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision–Language Models, Large Multimodal Models, Concept-Guided Feature Extraction, Feature Attribution, Interpretability, Explainable AI (XAI), Concept Attribution, Representation Learning, Cross-Modal Understanding, Model Transparency

TL;DR: We propose a method to extract and quantify monosemantic concept in large vision-language models (LVLMs), enabling better interpretability of their internal representations.

Abstract: Autoregressive Large Vision-Language Models (LVLMs) generate text sequentially, conditioning each token on evolving multimodal states. This makes it difficult to assess whether predictions are grounded in \textbf{visual concepts} or instead reflect hallucination or bias. Existing concept-discovery approaches such as \textbf{TCAV}, \textbf{CRAFT}, and \textbf{CLIP-Dissect} are designed for encoder-only or contrastive models. At the same time, recent LVLM methods (CoX-LMM) depend on labeled concepts and simplified settings, limiting scalability. We propose \textbf{Concept-Guided Dictionary Learning (CGDL)}, an sem-supervised and scalable framework for multimodal concept discovery in autoregressive LVLMs. CGDL first probes the model to surface textual concepts from a dataset. For each concept, it constructs positive and negative patch sets using concept-grounded crops and randomized backgrounds. A contrastive dictionary-learning stage then disentangles concept-aligned activations from residual noise, yielding sparse, monosemantic vectors that reveal \textbf{semantically aligned visual–textual interactions} and enable faithful attribution of predictions to visual evidence. On \textbf{ImageNet-1k, MSCOCO}, CGDL outperforms recent interpretability methods with up to \textbf{4\% higher sparsity}, \textbf{11\% greater stability}, \textbf{17\% lower overlap}, and strong attribution faithfulness, while scaling efficiently to large concept vocabularies. These results advance concept-based interpretability for LVLMs and provide a practical step toward transparent multimodal reasoning.

Supplementary Material: pdf

Primary Area: interpretability and explainable AI

Submission Number: 5603

Loading