Multi-modal multi-head self-attention for medical VQA

Vasudha Joshi; Pabitra Mitra; Supratik Bose

Multi-modal multi-head self-attention for medical VQA

Vasudha Joshi, Pabitra Mitra, Supratik Bose

Published: 01 Jan 2024, Last Modified: 09 Oct 2024Multim. Tools Appl. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Medical Visual Question answering (MedVQA) systems provide answers to questions based on radiology images. Medical images are more complex than general images. They have low contrast and are very similar to one another. The difference between medical images can only be understood by medical practitioners. While general images have very high quality and their differences can easily be spotted by anyone. Therefore, methods used for general-domain Visual Question Answering (VQA) Systems can not be used directly. The performance of MedVQA systems depends mainly on the method used to combine the features of the two input modalities: medical image and question. In this work, we propose an architecturally simple fusion strategy that uses multi-head self-attention to combine medical images and questions of the VQA-Med dataset of the ImageCLEF 2019 challenge. The model captures long-range dependencies between input modalities using the attention mechanism of the Transformer. We have experimentally shown that the representational power of the model is improved by increasing the length of the embeddings, used in the transformer. We have achieved an overall accuracy of 60.0% which improves by 1.35% from the existing model. We have also performed the ablation study to elucidate the importance of each model component.

Loading