Region-Specific Retrieval Augmentation for Longitudinal Visual Question Answering: A Mix-and-Match Paradigm
Abstract: Visual Question Answering (VQA) has advanced in recent years, inspiring adaptations to radiology for medical diagnosis. Longitudinal VQA, which requires an understanding of changes in images over time, can further support patient monitoring and treatment decision-making. This work introduces RegioMix, a retrieval augmented paradigm for longitudinal VQA, formulating a novel approach that generates retrieval objects through a mix-and-match technique, utilizing different regions from various retrieved images. Furthermore, this process generates a pseudo-difference description based on the retrieved pair, by leveraging available reports from each retrieved region. To align such statements to both the posed question and input image pair, we introduce a Dual Alignment module. Experiments on the MIMIC-Diff-VQA X-ray dataset demonstrate our method’s superiority, outperforming the state-of-the-art by 77.7 in CIDEr score and \(8.3\%\) in BLEU-4, while relying solely on the training dataset for retrieval, showcasing the effectiveness of our approach. Code is available at https://github.com/KawaiYung/RegioMix.
Loading