MAMF: A Multi-Level Attention-Based Multimodal Fusion Model for Medical Visual Question Answering

Published: 01 Jan 2023, Last Modified: 18 Jun 2024NCAA (2) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Medical Visual Question Answering (VQA) targets at accurately answering clinical questions about images. The existing medical VQA models show great potential, but most of them ignore the influence of word-level fine-grained features which benefit filtering out irrelevant regions in medical images more precisely. We present a Multi-level Attention-based Multimodal Fusion model named MAMF, aiming at learning a multi-level multimodal semantic representation for medical VQA. First, we develop a Word-to-Image attention and a Sentence-to-Image attention to obtain the correlations of word embeddings and question feature to image feature. In addition, we propose an attention alignment loss which contributes to adjust the weights of image regions gained from word embeddings and question feature to emphasize relevant regions for improving the quality of predicted answers. Results on VQA-RAD and PathVQA datasets suggest that our MAMF significantly outperforms the related state-of-the-art baselines.
Loading