Keywords: Medical Visual Question Answering, VQA, Medical Image Interpretation, Radiology
TL;DR: This work proposes a novel approach that enhances VQA models with textual context from medical reports associated with medical images using a transformer-based alignment module.
Abstract: Given a medical image and a question in natural language,
medical VQA systems are required to predict clinically relevant answers.
Integrating information from visual and textual modalities requires com-
plex fusion techniques due to the semantic gap between images and text,
as well as the diversity of medical question types. To address this chal-
lenge, we propose aligning image and text features in VQA models by
using text from medical reports to provide additional context during
training. Specifically, we introduce a transformer-based alignment mod-
ule that learns to align the image with the textual context, thereby in-
corporating supplementary medical features that can enhance the VQA
model’s predictive capabilities. During the inference stage, VQA operates
robustly without requiring any medical report. Our experiments on the
Rad-Restruct dataset demonstrate a significant impact of the proposed
strategy and show promising improvements, positioning our approach as
competitive with state-of-the-art methods in this task.
Camera Ready Submission: zip
Submission Number: 5
Loading