Development of a large-scale medical visual question-answering dataset

Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, Weidi Xie

Published: 20 Dec 2024, Last Modified: 06 Mar 2025Communication MedicineEveryoneCC BY-NC-ND 4.0

Abstract: Medical Visual Question Answering (MedVQA) enhances diagnostic accuracy and healthcare delivery by leveraging artificial intelligence to interpret medical images. This study aims to redefine MedVQA as a generation task that mirrors human–machine interaction and to develop a model capable of integrating complex visual and textual information. We constructed a large-scale medical visual-question answering dataset, PMC-VQA, containing 227,000 VQA pairs across 149,000 images that span various modalities and diseases. We introduced a generative model that aligns visual information from a pre-trained vision encoder with a large language model. This model was initially trained on PMC-VQA and subsequently fine-tuned on multiple public benchmarks. Here, we show that our model significantly outperforms existing MedVQA models in generating relevant, accurate free-form answers. We also propose a manually verified test set that presents a greater challenge and serves as a robust measure to monitor the advancement of generative MedVQA methods. The PMC-VQA dataset proves to be an essential resource for the research community, and our model marks a significant breakthrough in MedVQA. We maintain a leaderboard to facilitate comprehensive evaluation and comparison, providing a centralized resource for benchmarking state-of-the-art approaches.