A Retriever-Reader Framework with Visual Entity Linking for Knowledge-Based Visual Question Answering
Abstract: In this paper, we propose a Retriever-Reader framework with Visual Entity Linking (RR-VEL) for knowledge-based visual question answering. Given images and original questions, the visual entity linking (VEL) module extracts key entities in images to replace the question referents for semantic disambiguation, achieving entity-oriented queries with explicit entities. Furthermore, the Retriever encodes the queries and knowledge items by Bert with a feed-forward layer, and obtains a set of knowledge candidates. The Reader encodes the questions with image captions and knowledge candidates in two branches, which avoids their interference during self-attentive encoding. Finally, the decoder of Reader fuses the encoded features to generate answers. Extensive experiments conducted on the two public datasets show that our method significantly outperforms the existing baselines.
Loading