Keywords: Vision transformers, Steering
TL;DR: The paper shows that specific attention heads govern how vision-language models resolve conflicts between internal knowledge and visual inputs, enabling controllable steering and more precise attribution than gradient-based methods.
Abstract: Vision-language models (VLMs) increasingly combine both visual and textual information to perform complex tasks. However, conflicts between their internal knowledge and external visual input can lead to hallucinations and unreliable predictions. In this work, we investigate the mechanisms that VLMs use to resolve cross-modal conflicts by introducing a dataset of multimodal counterfactual queries that deliberately contradict internal commonsense knowledge. Through logit inspection, we identify a small set of attention heads that mediate this conflict. By intervening in these heads, we can steer the model towards its internal knowledge or the visual inputs. Our results show that attention from these heads effectively locates image regions that influence visual overrides, providing a more precise attribution compared to gradient-based methods.
Submission Number: 17
Loading