Context-Guided Medical Visual Question Answering

Wafa Arsalane; Philip Chikontwe; Miguel Luna; Myeongkyun Kang; Sang Hyun Park

Context-Guided Medical Visual Question Answering

Wafa Arsalane, Philip Chikontwe, Miguel Luna, Myeongkyun Kang, Sang Hyun Park

Published: 16 Jul 2024, Last Modified: 30 Aug 2024MICCAI Student Board EMERGE Workshop 2024 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Medical Visual Question Answering, VQA, Medical Image Interpretation, Radiology

TL;DR: This work proposes a novel approach that enhances VQA models with textual context from medical reports associated with medical images using a transformer-based alignment module.

Abstract: Given a medical image and a question in natural language, medical VQA systems are required to predict clinically relevant answers. Integrating information from visual and textual modalities requires com- plex fusion techniques due to the semantic gap between images and text, as well as the diversity of medical question types. To address this chal- lenge, we propose aligning image and text features in VQA models by using text from medical reports to provide additional context during training. Specifically, we introduce a transformer-based alignment mod- ule that learns to align the image with the textual context, thereby in- corporating supplementary medical features that can enhance the VQA model’s predictive capabilities. During the inference stage, VQA operates robustly without requiring any medical report. Our experiments on the Rad-Restruct dataset demonstrate a significant impact of the proposed strategy and show promising improvements, positioning our approach as competitive with state-of-the-art methods in this task.

Camera Ready Submission: zip

Submission Number: 5

Loading