Keywords: Medical Multimodal Benchmark, Body Lesion Images, Medical VQA
Abstract: Body-surface health conditions, spanning diverse clinical departments, represent some of
the most frequent diagnostic scenarios and a primary target for medical multimodal large
language models (MLLMs). Yet existing medical benchmarks are either built from publicly
available sources with limited expert curation or focus narrowly on disease classification,
failing to reflect the stepwise recognition and reasoning processes physicians follow in
real practice. To address this gap, we introduce MedLesionVQA, the first benchmark
explicitly designed to evaluate MLLMs on the visual diagnostic workflow for body-surface
conditions in large scale. All questions are derived from authentic clinical visual diagnosis
scenarios and verified by medical experts with over 20 years of experience, while the
data are drawn from 10k+ real patient visits, ensuring authenticity, clinical reality and
diversity. MedLesionVQA consists of 12K in-house images (never publicly leaked) and
19K expert-verified question–answer pairs, with fine-grained annotations of 94 lesion types,
110 body regions, and 96 diseases. We evaluate 20+ state-of-the-art MLLMs against
human physicians: the best model reaches 56.2% accuracy, far below primary physicians
(61.4%) and senior specialists (73.2%). These results expose the persistent gap between
MLLMs and clinical expertise, underscoring the need for the multimodal benchmarks to
drive trustworthy medical AI. The dataset can be found in https://github.com/bytedance/MedLesionVQA.
Primary Area: datasets and benchmarks
Submission Number: 15645
Loading