See, Think, Hallucinate: Interpreting Reasoning and Hallucinations Beyond the First Hop in Vision-Language Models

See, Think, Hallucinate: Interpreting Reasoning and Hallucinations Beyond the First Hop in Vision-Language Models

ICLR 2026 Conference Submission22787 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Models, Multimodal reasoning, Multi-hop question answering, Hallucination, Interpretability

Abstract: Vision-language models (VLMs) are prone to hallucinations, including errors such as factual inaccuracies, biases, and reasoning failures. Prior research has primarily focused on object hallucinations in single-hop settings, where models are asked to describe an image and are evaluated on whether they mention non-existent objects. However, such work overlooks broader forms of hallucination that arise in more complex reasoning scenarios. In this paper, we investigate hallucinations in vision-language reasoning beyond the first hop, where models must first extract factual content from an image and then combine it with external knowledge to answer a question. In particular, we first present MMHop, a dataset of multimodal two-hop questions spanning five knowledge categories: general reasoning, perceptual co-occurrence, temporal knowledge, cultural and regional knowledge, and biasd prior knowledge. Using MMHop, we conduct a systematic analysis of VLMs with different architectures and LLM backbones, uncovering where hallucinations arise and how reasoning unfolds. Our comparative study reveals distinct failure tendencies: some models are easily distracted by visual co-occurrence, while others rely excessively on internal knowledge or stereotypical priors. Beyond model-specific behaviors, our results highlight common structural patterns in two-hop reasoning. VLMs exhibit a two-stage inference process: an input understanding stage, dominated by multi-head attention, followed by a reasoning stage, where feed-forward networks become increasingly important. Early reasoning layers primarily capture first-hop inference, while later layers focus on second-hop reasoning. We further identify failure modes across categories: shortcut reliance on visual context, shallow recall of temporal knowledge, weak cultural grounding, and bias-driven errors. Finally, we show that question variants and inference settings, such as test-time scaling, can alter reasoning dynamics and reduce hallucination. Our analyses provide new interpretability-driven insights into multimodal hallucinations, paving the way toward more reliable and trustworthy vision-language reasoning systems.

Supplementary Material: zip

Primary Area: interpretability and explainable AI

Submission Number: 22787

Loading