Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Yuyang Hong; Jiaqi Gu; Qi Yang; Lubin Fan; Yue Wu; Ying Wang; Kun Ding; Shiming Xiang; Jieping Ye

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Yuyang Hong, Jiaqi Gu, Qi Yang, Lubin Fan, Yue Wu, Ying Wang, Kun Ding, Shiming Xiang, Jieping Ye

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Knowledge-based Visual Question Answering; Retrieval-Augmented Generation

TL;DR: Give VLM the ability to call tools to retrieve and filter information

Abstract: The task of Knowlegde-Based Visual Question Answering (KB-VQA) requires the model to understand visual features and retrieve external knowledge. Retrieval-Augmented Generation (RAG) have been employed to address this problem through knowledge base querying. However, existing work demonstrate two limitations: insufficient interactivity during knowledge retrieval and ineffective organization of retrieved information for Visual-Language Model (VLM). To address these challenges, we propose a three-stage visual language model with Process, Retrieve and Filter (VLM-PRF) framework. For interactive retrieval, VLM-PRF uses reinforcement learning (RL) to guide the model to strategically process information via tool-driven operations. For knowledge filtering, our method trains the VLM to transform the raw retrieved information into into task-specific knowledge. With a dual reward as supervisory signals, VLM-PRF successfully enable model to optimize retrieval strategies and answer generation capabilities simultaneously. Experiments on two datasets demonstrate the effectiveness of our framework.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 7626

Loading