Track: long paper (up to 10 pages)
Keywords: Large Language Model, Multi-Modality, Medical Question Answering, Multimodality and Language Grounding to Vision
TL;DR: We propose R-LLaVA, a Med-VQA model that leverages visual regions of interest via simple annotations to enhance biomedical understanding and outperform SoTA models.
Abstract: Artificial intelligence has made significant strides
in medical visual question answering (Med-VQA), yet prevalent
studies often interpret images holistically, overlooking the visual
regions of interest that may contain crucial information, potentially
aligning with a doctor’s prior knowledge that can be incorporated
with minimal annotations (e.g., bounding boxes). To address
this gap, this paper introduces R-LLaVA, designed to enhance
biomedical VQA understanding by integrating simple medical
annotations as prior knowledge directly into the image space
through CLIP. These annotated visual regions of interest are
then fed into the LLaVA model during training, aiming to enrich
the model’s understanding of biomedical queries. Experimental
evaluation on four standard Med-VQA datasets demonstrates R-LLaVA’s superiority over existing state-of-the-art (SoTA) methods.
Additionally, to verify the model’s capability in visual comprehension, a novel multiple-choice medical visual understanding dataset
is introduced, confirming the positive impact of focusing on visual
regions of interest in advancing biomedical VQA understanding.
Submission Number: 9
Loading