Beyond the Scanner: A Benchmark for Medical Photograph Understanding

01 Dec 2025 (modified: 15 Dec 2025)MIDL 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Benchmark, evaluation, natural images, medical images, visual question- answering
TL;DR: Benchmarking model performance on natural images with medical content.
Abstract: Everyday medical photographs, or images of people or body parts captured with ordinary cameras, are widely accessible to patients but neglected in medical AI benchmarks. To address this gap, we introduce MedPhoto, a dataset of 984 expert-verified multiple-choice questions spanning seven topics, including Eyes, Trunk & Extremities and Head & Neck, and requiring both recognition of fine-grained visual details and complex medical reasoning. We evaluate three vision-language models (VLMs) under a multiple-choice setting, and find that Gemini-3 and GPT-5 achieve 78% and 68% accuracy respectively, while MedGemma only reaches 39%. MedPhoto exposes significant gaps in current VLMs' ability to interpret everyday medical photographs, highlighting the need for models that can reason more robustly about the medical content in natural images.
Primary Subject Area: Foundation Models
Secondary Subject Area: Application: Other
Registration Requirement: Yes
Visa & Travel: No
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 218
Loading