MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

ACL ARR 2026 January Submission9489 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Medical Benchmark, Temporal Reasoning, Vision Language Model
Abstract: Longitudinal chest X-ray (CXR) interpretation requires reasoning over disease evolution across multiple patient visits, yet most existing medical VQA benchmarks focus on single images or short-horizon image pairs. We introduce **MI-CXR**, a benchmark for standardized evaluation of **M**ulti-**I**nterval longitudinal reasoning over multi-visit **CXR** sequences, without requiring free-form report generation or additional clinical context. MI-CXR comprises five-way multiple-choice questions over five-visit patient timelines and instantiates three complementary task families: *Temporal Event Localization*, *Interval-wise Change Reasoning*, and *Global Trajectory Summarization*, which assess clinically grounded visual reasoning over time. Evaluating 14 state-of-the-art vision–language models (VLMs) shows low overall performance (29.3% accuracy), only modestly above random guessing. Using stage-wise diagnostic probing, we find that models often produce locally plausible interval descriptions but fail to enforce temporal constraints or compose evidence into globally consistent decisions over the full timeline. These findings reveal key limitations of current VLMs and establish MI-CXR as a principled benchmark for longitudinal medical reasoning. The benchmark is available at: <https://github.com/anonymousetrap12/MI-CXR>
Paper Type: Long
Research Area: Clinical and Biomedical Applications
Research Area Keywords: medical question answering, Clinical and biomedical language models
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 9489
Loading