Is Unimodal Bias Always Bad for Visual Question Answering? A Medical Domain Study with Dynamic Attention

Zhongtian Sun, Anoushka Harit, Alexandra I. Cristea, Jialin Yu, Noura Al Moubayed, Lei Shi

Published: 01 Jan 2022, Last Modified: 14 Nov 2024IEEE Big Data 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Medical visual question answering (Med-VQA) is to answer medical questions based on clinical images provided. This field is still in its infancy due to the complexity of the trio formed of questions, multimodal features and expert knowledge. In this paper, we tackle, a ’myth’ in the Natural Language Processing area - that unimodal bias is always considered undesirable in learning models. Additionally, we study the effect of integrating a novel dynamic attention mechanism into such models, inspired by a recent graph deep learning study.Unlike traditional attention, dynamic attention scores are conditioned on different query words in a question and thus enhance the representation learning ability of texts. We propose that some questions are answered more accurately with a reinforcement of question embedding after fusing multimodal features. Extensive experiments have been implemented on the VQA-RAD datasets and demonstrate that our proposed model, reinforCe unimOdal dynamiC Attention (COCA), outperforms the state-of-the-art methods overall and performs competitively at open-ended question answering.