Efficient Multimodal Selection for Retrieval in Knowledge-Based Visual Question Answering

Published: 2025, Last Modified: 22 Jan 2026IEEE Trans. Circuits Syst. Video Technol. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Retrieval plays an important role in knowledge-based visual question answering (KB-VQA), which relies on external knowledge to answer questions related to an image. However, not all information in the external knowledge is beneficial in retrieval, e.g., the knowledge that is only semantically similar to the query but is not useful for question answering. To improve the effectiveness and efficiency of retrieval, in this paper, we propose efficient multimodal selection to filter out irrelevant information and increase the retriever performance for KB-VQA. First, to exclude most irrelevant knowledge from the large external knowledge, multimodal selection uses a query-aware sample selection method, which uses the pretrained answer generator’s prediction to obtain better positive and negative training samples to help retrievers distinguish knowledge that is semantically relevant to the multimodal query. Then, question-aware visual feature selection is proposed to select the distinguishable visual information related to the question: where cross-attention to questions and images is proposed to obtain question-aware visual features. These visual features are used to perform fine-grained multimodal retrieval within the small set to obtain the final top-related knowledge. The experimental results show that the proposed approach achieves state-of-the-art retrieval performance on the OK-VQA and FVQA datasets, indicating the effectiveness of our selection strategy for retrieval.
Loading