CRA-Net: Composed Relation Attention Network for Visual Question Answering

Zheng WANG

11 Nov 2022OpenReview Archive Direct UploadReaders: Everyone

Abstract: The task of Visual Question Answering (VQA) is to answer a natural language question tied to the content of a visual image. Most existing VQA models either apply attention mechanism to locate the relevant object regions and/or utilize the off-the-shelf methods of the relation reasoning to detect object relations. However, they 1) mostly encode the simple relations which cannot sufficiently provide sophisticated knowledge for answering complicated visual questions; 2) seldom leverage the harmony cooperation of the object appearance feature and relation feature. To address these problems, we propose a novel end-to-end VQA model, termed Composed Relation Attention Network (CRA-Net ). In specific, we devise two question-adaptive relation attention modules that can extract not only the fine-grained and precise binary relations but also the more sophisticated trinary relations. Both kinds of question-related relations can reveal deeper semantics, thereby enhancing the reasoning ability in question answering. Furthermore, our CRA-Net also combines the object appearance feature with the relation feature under the guidance of the corresponding question, which can reconcile the two types of features effectively. Extensive experiments on two large benchmark datasets, VQA-1.0 and VQA-2.0, demonstrate that our proposed model outperforms state-of-the-art approaches.

0 Replies