Abstract: One of the most complex multi-model problems faced today is Visual Question Answering (VQA), which requires a machine to properly understand a question about a reference visual input, expressed in natural language, and then produce the answer to that question. In order to solve this problem and increase the probability of producing the correct answer, it is crucial to provide reliable attention information. However, existing methods only use implicitly trained attention models that are often unable to attend to the appropriate image region the question refers to, limiting their ability to provide the correct answer. To address this issue, we propose an explicitly trained attention model that is inspired by the theory of pictorial superiority effect. In this model, we use attention-oriented word embeddings that increase the efficiency of learning common representation spaces. The dataset that we use, the Visual7W dataset, is the only dataset that provides visual attention ground truth information. In this paper, we demonstrate the effectiveness of the proposed method over both implicit attention models and other state-of-art VQA techniques.
0 Replies
Loading