Keywords: medical visual question answering; medical VQA benchmarks
Abstract: Medical education videos capture the systematic, multi-image diagnostic reasoning that clinicians employ in practice—examining series of related scans, comparing views, and synthesizing findings across modalities.
To evaluate whether MLLMs can perform this fundamental aspect of clinical reasoning, we introduce MedFrameQA—the first benchmark explicitly designed to test multi-image medical VQA through educationally-validated diagnostic sequences.
To build MedFrameQA with high-scalability and high-quality, we develop 1) an automated pipeline that extracts temporally coherent frames from medical videos and constructs VQA items whose content evolves logically across images,
and 2) a multiple-stage filtering strategy, including model-based and manual review, to preserve data clarity, difficulty, and medical relevance. The resulting dataset
comprises 2,851 VQA pairs (gathered from 9,237 high-quality frames in 3,420 videos), covering nine human body systems and 43 organs; every question is accompanied by two to five images.
We comprehensively benchmark 11 advanced Multimodal LLMs---both proprietary and open source, with and without explicit reasoning modules---on MedFrameQA. The evaluation challengingly reveals that all models perform poorly, with most accuracies below 50\%, and accuracy fluctuates as the number of images per question increases.
Error analysis further shows that models frequently ignore salient findings, mis-aggregate evidence across images, and propagate early mistakes through their reasoning chains; results also vary substantially across body systems, organs, and modalities.
These findings highlight a critical gap: while MLLMs may handle single-image medical tasks, they fail at the multi-image comparative reasoning that defines real clinical practice.
We hope this work can catalyze research on clinically grounded, multi-image reasoning and accelerate progress toward more capable diagnostic AI systems.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 13382
Loading