Retrieval-Augmented Reasoning for Visual Localization

Published: 15 Nov 2025, Last Modified: 08 Mar 2026AAAI 2026 Bridge LMReasoningEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Retrieval-Augmented Generation, Multimodal Chain-of-Thought, Vision-Language Models, Medical Image
TL;DR: The RAR-VL framework makes VLMs reliable for medical localization. It uses RAG to retrieve verifiable evidence and MCOT for traceable reasoning, achieving SOTA zero-shot performance while boosting trustworthiness.
Abstract: Open-vocabulary medical image localization holds significant potential for clinical applications. However, the practical reliability of current Vision-Language Models (VLMs) is constrained by critical limitations. They generate spatial prompts from statistical patterns rather than explicit medical evidence, resulting in unreliable localization. Furthermore, this implicit reasoning process is untraceable, failing to meet the clinical demand for evidence-based decision-making. To address these challenges, we propose RAR-VL (Retrieval-Augmented Reasoning for Visual Localization), a framework that transforms VLMs from implicit guessers into explicit reasoners. RAR-VL achieves this by integrating two key components: Retrieval-Augmented Generation (RAG) to source verifiable evidence from a medical knowledge base, and a Multimodal Chain-of-Thought (MCoT) to construct a structured, traceable reasoning path from evidence to localization. Experiments validate RAR-VL's state-of-the-art performance in zero-shot localization tasks, where it significantly outperforms existing open-vocabulary baselines. These results confirm that our retrieval-augmented reasoning framework effectively enhances both localization reliability and clinical trustworthiness.
Submission Number: 35
Loading