TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models

TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models

ICLR 2026 Conference Submission8366 Authors

17 Sept 2025 (modified: 20 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Medical Reasoning Benchmark, Vision-Language Models, Retrieval-Augmented Generation, Multi-Modal Retrieval Augmentation

TL;DR: The first benchmark for LVLMs' temporal medical image reasoning via cross-visit change tracking, showing broad limitations in change analysis and highlighting that multi-modal retrieval augmentation outperforms textual retrieval in this setting.

Abstract: Existing medical reasoning benchmarks for vision-language models primarily focus on analyzing a patient's condition based on an image from a *single* visit. However, this setting deviates significantly from real-world clinical practice, where doctors typically refer to a patient's historical conditions to provide a comprehensive assessment by tracking their changes over time. In this paper, we introduce TemMed-Bench, a multi-task benchmark designed for analyzing changes in patients' conditions between different clinical visits, which challenges large vision-language models (LVLMs) to reason over **tem**poral **med**ical images. TemMed-Bench consists of a test set comprising three tasks – visual question-answering (VQA), report generation, and image-pair selection – and a supplementary knowledge corpus of over 17,000 instances. With TemMed-Bench, we conduct an evaluation of twelve LVLMs, comprising six proprietary and six open-source models. Our results show that most LVLMs lack the ability to analyze patients' condition changes over temporal medical images, and a large proportion perform only at a random-guessing level in the closed-book setting. In contrast, GPT o3, o4-mini and Claude 3.5 Sonnet demonstrate comparatively decent performance, though they have yet to reach the desired level. To enhance the tracking of condition changes, we explore augmenting the input with both retrieved visual and textual modalities in the medical domain. We also show that multi-modal retrieval augmentation yields notably higher performance gains than no retrieval and textual retrieval alone across most models on our benchmark, with the VQA task showing an average improvement of 2.59%. Overall, we compose a benchmark grounded on real-world clinical practice, and it reveals LVLMs' limitations in temporal medical image reasoning, as well as highlighting the use of multi-modal retrieval augmentation as a potentially promising direction worth exploring to address this challenge.

Primary Area: datasets and benchmarks

Submission Number: 8366

Loading