Keywords: Multimodal RAG, Multi-modal Retrieval, Visual Question Answering
Abstract: Vision Large Language Models (VLLMs) have achieved remarkable success in visual question answering, but suffer from critical hallucination problems, generating confident-sounding but factually incorrect responses. While Retrieval-Augmented Generation (RAG) offers promising solutions, existing multi-modal RAG approaches face three key limitations: retrieval strategies that ignore model confidence, lack of effective hallucination detection, and models not trained to express uncertainty. We propose ConfRAG, a confidence-calibrated retrieval-augmented generation framework that systematically addresses these limitations through three core innovations. First, our confidence-aware retrieval mechanism employs several confidence thresholds to filter high-quality evidence during both image-based and web-based retrieval. Second, our hybrid hallucination detection module uses practical rules—generation termination analysis and average token probability assessment—to identify unreliable content. Third, our IDK-aware training strategy independently optimizes three specialized pipelines (direct QA, image RAG, web RAG) using quality-based sampling to teach appropriate uncertainty expression. Comprehensive experiments on the Meta CRAG-MM challenge demonstrate ConfRAG's effectiveness, achieving 7th place overall with consistent performance across all three tasks. Notably, our IDK training transforms severely negative baselines into positive performance, demonstrating dramatic hallucination reduction while maintaining competitive accuracy. Our code and data are available at https://github.com/BUAADreamer/ConfRAG.
Submission Number: 3
Loading