Mitigating Visual Hallucinations via Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation
Keywords: Visual Retrieval-Augmented Generation
Abstract: Visual Retrieval-Augmented Generation (VRAG) enhances Vision–Language Models (VLMs) by incorporating retrieved images as contextual evidence to support reasoning. However, existing VRAG systems often struggle to reliably perceive and integrate evidence across multiple images, leading to erroneous reasoning and visual hallucinations. In this paper, we propose EVisRAG, an end-to-end framework for evidence-guided multi-image reasoning that mitigates these issues by explicitly observing images, recording per-image evidence, and reasoning over aggregated evidence to derive the final answer. To train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which assigns fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning. Experiments on multiple visual question answering benchmarks show that EVisRAG consistently outperforms backbone VLMs, achieving an average improvement of 27\%, while substantially reducing visual hallucinations through accurate evidence localization and grounded reasoning.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Generation
Languages Studied: English
Submission Number: 502
Loading