Keywords: mechanistic interpretation, vision-language models, knowledge neuron
Abstract: We investigate how the vision encoder of vision-language models (VLMs), such as CLIP, store associative knowledge about specific entities. We develop attribution methods to identify "knowledge neurons" within CLIP's visual encoder that enable recognition of entities such as celebrities, cartoon characters, and cultural symbols. Our analysis reveals that recognition of specific entities is primarily facilitated by a small subset of neurons in the later feed-forward network (FFN) layers. We then propose techniques to dissect these knowledge neurons from both visual and linguistic perspectives, demonstrating that they are activated exclusively by visual signals of specific entities in complex images and encode semantically relevant concepts. Building on these findings, we propose two practical applications: selectively removing sensitive knowledge and inserting new entity associations without degrading overall model performance. Our work contributes novel methods for neuron-level attribution, interpretable techniques for knowledge understanding, and effective approaches for targeted knowledge editing in VLMs.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 1888
Loading