RIV-CoT: Retrieval-Based Interleaved Visual Chain-of-Thought for Multimodal Reasoning

Published: 23 Sept 2025, Last Modified: 19 Nov 2025SpaVLE PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision-language models, chain-of-thought, visual reasoning, spatial reasoning
TL;DR: We create a visual question answering dataset annotated with grounded reasoning traces, along with a method to perform visual reasoning by retrieving image patches.
Abstract: While chain-of-thought (CoT) prompting improves reasoning in large language models, its effectiveness in vision-language models (VLMs) remains limited due to over-reliance on textual cues and memorized knowledge. To investigate the visual reasoning capabilities of VLMs in complex real-world scenarios, we introduce DrivingVQA, a visual question answering dataset derived from driving theory exams, which contains 3,931 multiple-choice problems with expert-written explanations and grounded entities relevant to the reasoning process. Leveraging this dataset, we explore the benefits of incorporating entity-related information, such as entity names, spatial coordinates, and visual content, through supervised fine-tuning to enhance the model’s reasoning abilities. Our experiments demonstrate that our proposed method, RIV-CoT -- interleaving textual explanations with visual tokens retrieved from the input image -- improves answer accuracy by 3.1\% and reasoning accuracy by 4.6\% over vanilla CoT prompting. Furthermore, we demonstrate that this retrieval-based approach effectively scales to the larger A-OKVQA reasoning dataset by leveraging automatically generated pseudo-labels, outperforming CoT prompting.
Submission Type: Long Research Paper (< 9 Pages)
Submission Number: 34
Loading