Visual Entity-Centric Prompting for Knowledge Retrieval in Knowledge-based VQA

Published: 01 Jan 2025, Last Modified: 05 Nov 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: External knowledge provides critical clues for knowledge-based visual question answering (KB-VQA), while the implicit knowledge in images is difficult to capture in order to construct effective queries for knowledge bases. To this end, we propose a visual entity-centric prompting for knowledge retrieval (VEPR) to bridge the gap between the implicit and explicit knowledge driven by visual entities via large language models. More specifically, a visual entity question answering (EQ) module is devised to localize the critical entities from the given images and questions to generate entity-centric questions via a large language model. In particular, EQ obtains several entity-centric question-answer pairs via a visual language model. Furthermore, a question-answer-enhanced retrieval (ER) module is devised to construct a query by summarizing the text including question-answer pairs, captions and questions, in order to require explicit knowledge items. Finally, a multi-branch reader (MR) module is designed to encode the given questions, visual content and retrieved knowledge items, which are decoded to make answer predictions. Extensive experiments conducted on two public datasets demonstrate the effectiveness of the VEPR.
Loading