Understanding the robustness of vision-language models to medical image artefacts

Zijie Cheng, Ariel Yuhan Ong, Siegfried K. Wagner, David Adrian Merle, Lie Ju, Hanyuan Zhang, Ruinian Chen, Linze Pang, Boxuan Li, Tiantian He, Emma Anran Ran, Hongyang Jiang, Gabriel Dawei Yang, Ke Zou, Jocelyn Hui Lin Goh, Sahana Srinivasan, André Altmann, Daniel C. Alexander, Carol Y. Cheung, Yih Chung Tham et al. (2 additional authors not shown)

Published: 2025, Last Modified: 02 Mar 2026npj Digit. Medicine 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Vision-language models (VLMs) show promise for answering clinically relevant questions, but their robustness to medical image artefacts remains unclear. We evaluated VLMs’ robustness through their performance on images with and without weak artefacts across five artefact categories, as well as their ability to detect images with strong artefacts. We built evaluation benchmarks using brain MRI scans, Chest X-ray, and retinal images, involving four real-world medical datasets. VLMs achieved moderate accuracy on original unaltered images (0.645, 0.602 and 0.604 for MRI, OCT, and X-ray applications, respectively). Accuracy declined with weak artefacts (−3.34%, −9.06% and −10.46%), while strong artefacts were detected at low rates (0.194, 0.128 and 0.115). Our findings indicated that VLMs are not yet capable of performing tasks on medical images with artefacts, underscoring the need to establish uniform benchmark thoroughly examining model robustness to image artefacts, and explicitly incorporate artefact-aware method design and robustness tests into VLM development.
Loading