RoRA-VLM: Robust Retrieval Augmentation for Vision Language Models

ACL ARR 2025 May Submission6026 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent vision-language models (VLMs), despite their broad capabilities, continue to underperform on knowledge-intensive tasks. Retrieval augmentation offers a promising solution by incorporating external multimodal knowledge. However, the retrieved content often contains a mix of relevant and irrelevant information, and existing methods primarily focus on improving retrieval quality to mitigate this issue. In this work, we propose RoRA-VLM, a robust retrieval augmentation framework designed to address the complementary challenge of utilizing noisy retrieved knowledge effectively. The core insight behind RoRA-VLM is that the multimodal nature of VLMs enables a novel solution: visual information can act as a signal for assessing the relevance of retrieved results. To this end, RoRA-VLM introduces a learned cross-modal verification mechanism that enables VLMs to compare visual similarities between the query and retrieved images, and attend selectively to visually relevant retrievals while filtering out irrelevant content. Extensive experiments on OVEN, InfoSeek, and Enc-VQA benchmarks demonstrate that RoRA-VLM achieves significant performance improvements of up to 14.76% in accuracy compared to baseline models with minimal training data, consistently outperforming state-of-the-art retrieval-augmented VLMs while exhibiting strong generalization to unseen domains.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: retrieval-augmented generation, vision language model
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 6026
Loading