Keywords: vision language model, medical image
TL;DR: A dataset and benchmark for quantitative medical image analysis tasks including detection, tumor size, angle, and distance measurement.
Abstract: Current vision-language models (VLMs) in medicine are primarily designed
for categorical question answering (e.g., “Is this normal or abnormal?”)
or qualitative tasks (e.g., “Describe the image”). However, clinical
decision-making often relies on quantitative assessments, such as measuring
the size of a tumor or the angle of a joint, from which physicians draw
their own diagnostic conclusions. This quantitative reasoning capability remains
underexplored and poorly supported in existing VLMs. In this work, we introduce
MedVision, a large-scale dataset and benchmark specifically designed to
evaluate and improve VLMs on quantitative medical image analysis. MedVision
spans 22 public datasets covering diverse anatomies and modalities, with
30.8 million image-annotation pairs. We focus on three
representative quantitative tasks: (1) detection of anatomical structures and
abnormalities, (2) tumor/lesion (T/L) size estimation, and (3) angle/distance
(A/D) measurement. Our benchmarks show that current off-the-shelf VLMs
perform poorly on these tasks. However, with supervised fine-tuning on
MedVision, we significantly enhance their performance across detection, T/L
estimation, and A/D measurement, demonstrating reduced error rates and
improved precision. This work provides a foundation for developing VLMs with
robust quantitative reasoning capabilities in medical imaging.
Primary Area: datasets and benchmarks
Submission Number: 19596
Loading