EuroRadQA: A Benchmark for Multimodal Differential Diagnosis in Radiology
Keywords: Multimodal Diagnostic Reasoning, Radiology Benchmark, Medical LLMs
Abstract: Recent advances in medical large language and vision-language models have led to near-expert performance on many simplified question-answering tasks; however, these tasks do not accurately reflect the true complexity of clinical reasoning.
We present EuroRadQA, the first benchmark designed to evaluate clinically authentic diagnostic reasoning over multi-image radiology cases.
Each EuroRadQA case includes multiple radiological images (e.g., different views or modalities) and is accompanied by expert-authored, step-by-step reasoning annotations. These annotations include per-image captions that serve as radiological hints, per-case image findings that provide an integrated imaging summary, and a detailed discussion mirroring the workflow of differential diagnosis.
This structured rationale extends beyond final answer accuracy, enabling the assessment of whether models can interpret each image and logically synthesize evidence like a human radiologist.
EuroRadQA substantially elevates the visual and reasoning complexity over prior benchmarks: unlike earlier medical VQA datasets that posed single-image, single-step questions answerable without extensive inference, or recent exam-style benchmarks that increase difficulty but remain text-only and lacking in clinical context, our benchmark challenges models to perform multi-step multimodal reasoning on real clinical cases.
By providing both rich visual input and expert reasoning traces, EuroRadQA enables rigorous evaluation of a model’s ability to think like a radiologist, identifying findings in each image, forming a differential, and justifying its diagnosis.
We hope this benchmark will drive the development of multimodal models with the genuine reasoning skills needed for high-stakes medical decision-making.
Submission Number: 172
Loading