An Interpretable Multimodal Visual Question Answering System using Attention-based Weighted Contextual Features
Abstract: Visual question answering (VQA) is a challenging task that requires a deep understanding of language and images. Currently, most VQA algorithms focus on finding the correlations between basic question embeddings and image features by using an element-wise product or bilinear pooling between these two vectors. Some algorithms also use attention models to extract features. In this extended abstract, a novel interpretable multimodal system using attention-based weighted contextual features (MA-WCF) is proposed for VQA tasks. This multimodal system can assign adaptive weights to the features of questions and images themselves and to their contextual features based on their importance. Our new model yields state-of-the-art results on the MS COCO VQA datasets for open-ended question tasks.
0 Replies
Loading