Abstract: Recently, visual question answering (VQA) based on the feature fusion between image vision and question text has attracted considerable research interests. The attention mechanism and dense iterative operations are adopted for fine-grained interplay and matching by aggregating the similarities of the image region and question word pairs. However, the autocorrelation information of image regions is ignored, which will lead to deviation in overall semantic understanding, thereby reducing the accuracy of answer prediction. Moreover, we notice that some valuable but unattended edge information of image is often completely forgotten after multiple bilateral co-attention operations. In this paper, a novel Compound-Attention Network with Original Feature injection is proposed to leverage both bilateral information and autocorrelation in a holistic deep framework. A visual feature enhancement mechanism is designed to mine more complete visual semantics and avoid understanding deviation. Then, an original feature injection module is proposed to retain the unattended edge information of the image. Extensive experiments conducted on VQA2.0 database demonstrate the effectiveness of the proposed method.
0 Replies
Loading