Abstract: ColPali recently proposed a method for explaining multimodal retrieval-augmented generation (RAG) by visualizing how vision–language models (VLMs) connect image patches to text tokens. However, our theoretical analysis and experiments show that these similarity-based saliency maps are fragile and often misleading. We therefore caution against relying solely on intuitive visualizations and present a principled patch-level dissection technique that traces how vision LLMs actually accumulate evidence across modalities. To address this issue, we introduce Needle-in-a-Patched-Haystack: a patch-centered dataset and metric suite that quantifies transparency by benchmarking localization performance in vision LLMs. Together, our analysis and toolkit establish a stricter standard for VLM interpretability and provide a drop-in evaluation protocol for future research on robust, multimodal explanations.
Loading