Medical thinking with multiple images

Zonghai Yao; Benlu Wang; Yifan Zhang; Junda Wang; Iris Xia; Zhipeng Tang; Shuo Han; Feiyun Ouyang; Zhichao Yang; Arman Cohan; hong yu

Medical thinking with multiple images

Zonghai Yao, Benlu Wang, Yifan Zhang, Junda Wang, Iris Xia, Zhipeng Tang, Shuo Han, Feiyun Ouyang, Zhichao Yang, Arman Cohan, hong yu

Published: 12 Oct 2025, Last Modified: 12 Nov 2025GenAI4Health 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Diagnostic Reasoning, Radiology Benchmark, Medical LLMs

Abstract: Large language models and vision-language models score high on many medical QA benchmarks; however, real-world clinical reasoning remains challenging because cases often involve multiple images and require cross-view fusion. We present MedThinkVQA, a benchmark that asks models to think with multiple images: read each image, merge evidence across views, and pick a diagnosis with stepwise supervision. We make three parts explicit: multi-image questions, expert-annotated stepwise supervision, and beyond-accuracy evaluation. Only MedThinkVQA combines all these parts in one expert-annotated benchmark. The dataset has 8,481 cases in total, with 751 test cases, and on average 6.51 images per case; it is expert-annotated and, at this level, larger and more image-dense than prior work (earlier maxima < 1.43 images per case). On the test set, GPT-5 achieves 57.39% accuracy, approximately 15 percentage points below the strongest result on the most challenging prior benchmark of a similar kind, while other strong models are lower (Qwen2.5-VL-32B: 39.54%, MedGemma-27B: 37.55%, InternVL3.5-38B: 43.14%). Giving expert findings and summaries brings clear gains, but using models' self-generated ones brings small or negative gains. Step-level evaluation shows where models stumble: errors center on image reading and cross-view integration in both decisive and non-decisive steps (>70%); when a step is decisive for the final choice, reasoning slips become more common (32.26%), while scenario and pure-knowledge slips are relatively rare (<10%). These patterns isolate and quantify the core obstacle: extracting and integrating cross-image evidence, rather than language-only inference.

Submission Number: 172

Loading