MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning

MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning

ICLR 2026 Conference Submission13382 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: medical visual question answering; medical VQA benchmarks

Abstract: Medical education videos capture the systematic, multi-image diagnostic reasoning that clinicians employ in practice—examining series of related scans, comparing views, and synthesizing findings across modalities. To evaluate whether MLLMs can perform this fundamental aspect of clinical reasoning, we introduce MedFrameQA—the first benchmark explicitly designed to test multi-image medical VQA through educationally-validated diagnostic sequences. To build MedFrameQA with high-scalability and high-quality, we develop 1) an automated pipeline that extracts temporally coherent frames from medical videos and constructs VQA items whose content evolves logically across images, and 2) a multiple-stage filtering strategy, including model-based and manual review, to preserve data clarity, difficulty, and medical relevance. The resulting dataset comprises 2,851 VQA pairs (gathered from 9,237 high-quality frames in 3,420 videos), covering nine human body systems and 43 organs; every question is accompanied by two to five images. We comprehensively benchmark 11 advanced Multimodal LLMs---both proprietary and open source, with and without explicit reasoning modules---on MedFrameQA. The evaluation challengingly reveals that all models perform poorly, with most accuracies below 50\%, and accuracy fluctuates as the number of images per question increases. Error analysis further shows that models frequently ignore salient findings, mis-aggregate evidence across images, and propagate early mistakes through their reasoning chains; results also vary substantially across body systems, organs, and modalities. These findings highlight a critical gap: while MLLMs may handle single-image medical tasks, they fail at the multi-image comparative reasoning that defines real clinical practice. We hope this work can catalyze research on clinically grounded, multi-image reasoning and accelerate progress toward more capable diagnostic AI systems.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 13382

Loading