Language and Visual Relations Encoding for Visual Question Answering

Fei Liu, Jing Liu, Zhiwei Fang, Hanqing Lu

2019 (modified: 10 Nov 2022)ICIP 2019Readers: Everyone

Abstract: Visual Question Answering (VQA) involves complex relations of two modalities, including the relations between words and between image regions. Thus, encoding these relations is important to accurate VQA. In this paper, we propose two modules to encode the two types of relations respectively. The language relation encoding module is proposed to encode multi-scale relations between words via a novel masked self-attention. The visual relation encoding module is proposed to encode the relations between image regions. It computes the response at a position as a weighted sum of the features at other positions in the feature maps. Extensive experiments demonstrate the effectiveness of each modules. Our model achieves state-of-the-art performance on the VQA 1.0 dataset.

0 Replies