Towards Understanding Associative Knowledge in Vision-language Models via Neuron-level Attribution

04 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mechanistic interpretation, vision-language models, knowledge neuron
Abstract: We investigate how the vision encoder of vision-language models (VLMs), such as CLIP, store associative knowledge about specific entities. We develop attribution methods to identify "knowledge neurons" within CLIP's visual encoder that enable recognition of entities such as celebrities, cartoon characters, and cultural symbols. Our analysis reveals that recognition of specific entities is primarily facilitated by a small subset of neurons in the later feed-forward network (FFN) layers. We then propose techniques to dissect these knowledge neurons from both visual and linguistic perspectives, demonstrating that they are activated exclusively by visual signals of specific entities in complex images and encode semantically relevant concepts. Building on these findings, we propose two practical applications: selectively removing sensitive knowledge and inserting new entity associations without degrading overall model performance. Our work contributes novel methods for neuron-level attribution, interpretable techniques for knowledge understanding, and effective approaches for targeted knowledge editing in VLMs.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 1888
Loading