Visual Question Answering with Fine-grained Knowledge Unit RAG and Multimodal LLMs

Zhengxuan Zhang; Yin WU; Yuyu Luo; Nan Tang

Visual Question Answering with Fine-grained Knowledge Unit RAG and Multimodal LLMs

Zhengxuan Zhang, Yin WU, Yuyu Luo, Nan Tang

26 Sept 2024 (modified: 13 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual Question Answering, Retrieval-Augmented Generation

Abstract: Visual Question Answering (VQA) aims to answer natural language questions based on information present in images. Recent advancements in multimodal large language models (MLLMs) with internalized world knowledge, such as GPT-4o, have demonstrated strong capabilities in addressing VQA tasks. However, in many real-world cases, MLLMs alone are not enough, as they may lack domain-specific or up-to-date knowledge relevant to images and questions. To mitigate this problem, retrieval-augmented generation (RAG) from external knowledge bases (KBs), known as KB-VQA, is promising for VQA. However, effectively retrieving relevant knowledge is not easy. Traditional wisdom typically converts images into text and employs unimodal (i.e. text-based) retrieval, which can lead to the loss of visual information and hinder accurate image-to-image matching. In this paper, we introduce fine-grained knowledge units including both text fragments and entity images, which are extracted from KBs and stored in vector databases. In practice, retrieving fine-grained knowledge units is more effective than retrieving coarse-grained knowledge, for finding relevant information. We also designed a knowledge unit retrieval-augmented generation (KU-RAG) method, through fine-grained retrieval and MLLMs. KU-RAG can accurately find corresponding knowledge, and integrate the retrieved knowledge with the internalized MLLM knowledge using a knowledge correction chain for reasoning. Experimental results indicate that our method can significantly enhance the performance of state-of-the-art KB-VQA solutions, with improvements by up to 10%.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5441

Loading