MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine

MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine

ICLR 2026 Conference Submission13328 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: medical imaging, MLM, LLM

TL;DR: Multimodal large language models (both generalists and specialists) fail at very simple medical visual tasks.

Abstract: Multimodal language models (MLMs) show promise for clinical decision support and diagnostic reasoning, raising the prospect of end-to-end automated medical image interpretation. However, clinicians are highly selective in adopting AI tools; a model that makes errors on seemingly simple perception tasks such as determining image orientation or identifying whether a CT scan is contrast-enhanced, are unlikely to be adopted for clinical tasks. We introduce Medblink, a benchmark designed to probe these models for such perceptual abilities. Medblink spans eight clinically meaningful tasks across multiple imaging modalities and anatomical regions, totaling 1,429 multiple-choice questions over 1,605 images. We evaluate 20 state-of-the-art MLMs, including general-purpose (GPT-5, Gemini 1.5 Pro, Claude 3.5 Sonnet) and domain-specific (Med-Flamingo, LLaVA-Med, RadFM) models. While human annotators achieve 96.4% accuracy, the best-performing model reaches only 76.3%. These results show that current MLMs frequently fail at routine perceptual checks, suggesting the need to strengthen their visual grounding to support clinical adoption.

Primary Area: datasets and benchmarks

Submission Number: 13328

Loading