Process-then-Retrieve: A Mechanistic Study of Cross-Modal Alignment in Vision-Language Models

Published: 02 Mar 2026, Last Modified: 02 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal, mechanistic interpretability, adapter-based VLMs
TL;DR: Adapter-based vision-language models like PaliGemma and Qwen2-VL follow a "process-then-retrieve" workflow, where early layers prioritize textual context and defer visual integration until the final layers.
Abstract: Understanding the internal integration of visual and textual data in vision–language models (VLMs) remains a significant challenge. We present a mechanistic study of adapter-based VLMs, using PaliGemma-3B and Qwen2-VL as representative models, to test the hypothesis that models follow a two-phase workflow: early layers prioritize textual processing, while later layers execute cross-modal retrieval. Using representational similarity analysis, attention patching, and residual stream attribution, we reveal that early layers preserve visual embeddings with minimal modification while focusing on text. Significant cross-modal alignment and visual attention appear only in the final layers. We find that this structural bias is a primary contributor to textual dominance, where linguistic priors can override conflicting visual evidence. Our results provide a foundation for addressing the "modality gap" and offer insights into multimodal reasoning in VLM architectures.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 86
Loading