RG-VQA: Leveraging Retriever-Generator Pipelines for Knowledge Intensive Visual Question Answering

RG-VQA: Leveraging Retriever-Generator Pipelines for Knowledge Intensive Visual Question Answering

ACL ARR 2025 May Submission5512 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this paper, we propose a method to improve the reasoning capabilities of Visual Question Answering (VQA) systems by integrating Dense Passage Retrievers (DPRs) with Vision Language Models (VLMs). While recent works focus on the application of knowledge graphs and chain-of-thought reasoning, we recognize that the complexity of graph neural networks and end-to-end training remain significant challenges. To address these issues, we introduce \textbf{\underline{R}}elevance \textbf{\underline{G}}uided \textbf{\underline{VQA}} (\textbf{RG-VQA}), a retriever-generator pipeline that uses DPRs to efficiently extract relevant information from structured knowledge bases. Our approach ensures scalability to large graphs without significant computational overhead. Experiments on the ScienceQA dataset show that RG-VQA achieves state-of-the-art performance, surpassing human accuracy and outperforming GPT-4 by more than $8\%$. This demonstrates the effectiveness of RG-VQA in boosting the reasoning capabilities of VQA systems and its potential for practical applications.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: VQA, Multimodality

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 5512

Loading