Keywords: Mechanistic Interpretability
TL;DR: The paper investigates why Vision-Language Models (VLMs) struggle with factual recall, identifying a "two-hop" problem where visual representations emerge too late to engage early-layer recall mechanisms.
Abstract: Vision-Language Models (VLMs) have demonstrated impressive capabilities, but struggle with factual recall tasks compared to their underlying language model (LM). While previous work attributes this to insufficient computational depth after visual processing, we provide an alternative explanation: the distributed representations of visual information across visual tokens in early layers bypasses the factual recall mechanism that resides in the early-layer MLPs of the LM backbone. The performance gap therefore stems from the architectural design of VLMs, rather than insufficient computational capacity. Using linear probes, we show that dedicated linear representations of visual information only emerge in the middle-to-late layers of VLMs. As a result, factual recall in VLMs becomes a “two-hop” challenge, where factual recall precedes visual processing, but the visual processing finishes too late in the model. Through comparative analysis, we demonstrate that successful factual recall depends on the speed of the first processing “hop.” To further support our hypothesis, we patch early-layer MLP outputs from the LM backbone into the corresponding VLM layers, significantly improving factual recall performance. This suggests that the absence of properly aligned token embeddings in early layers is a key factor in factual recall degradation. Finally, we introduce a benchmark to systematically evaluate factual recall accuracy and knowledge hallucination in multimodal settings. Our findings highlight a fundamental architectural limitation in current VLMs and pave the way for designing models that better integrate visual and linguistic information for reliable factual reasoning.
Public: Yes
Submission Number: 19
Loading