Leveraging Coarse-to-Fine Grained Representations in Contrastive Learning for Differential Medical Visual Question Answering

Xiao Liang, Yin Wang, Di Wang, Zhicheng Jiao, Haodi Zhong, Mengyu Yang, Quan Wang

Published: 01 Jan 2024, Last Modified: 05 Nov 2025MICCAI (5) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Chest X-ray Differential Medical Visual Question Answering (Diff-MedVQA) is a novel multi-modal task designed to answer questions about diseases, especially their differences, based on a main image and a reference image. Compared to the widely explored visual question answering in the general domain, Diff-MedVQA presents two unique issues: (1) variations in medical images are often subtle, and (2) it is impossible for two chest X-rays taken at different times to be at exactly the same view. These issues significantly hinder the ability to answer questions about medical image differences accurately. To address this, we introduce a two-stage framework featuring Coarse-to-Fine Granularity Contrastive Learning. Specifically, our method initially employs an anatomical encoder and a disease classifier to obtain fine-grained visual features of main and reference images. It then integrates the anatomical knowledge graph to strengthen the relationship between anatomical and disease regions, while Multi-Change Captioning transformers identify the subtle differences between main and reference features. During pre-training, Coarse-to-Fine Granularity Contrastive Learning is used to align knowledge enhanced visual differences with keyword features like anatomical parts, symptoms, and diseases. During the Diff-MedVQA fine-tuning, the model treats the differential features as context-grounded queries, with language modeling guiding answer generation. Extensive experiments on the MIMIC-CXR-Diff dataset validate the effectiveness of our proposed method. Code is available at https://github.com/big-white-rabbit/Coarse-to-Fine-Grained-Contrastive-Learning.