Consistent but Dangerous: Per-Sample Safety Classification Reveals False Reliability in Medical Vision-Language Models

Binesh Sadanandan; Vahid Behzadan

Consistent but Dangerous: Per-Sample Safety Classification Reveals False Reliability in Medical Vision-Language Models

Binesh Sadanandan, Vahid Behzadan

Published: 26 Apr 2026, Last Modified: 26 Apr 2026Med-Reasoner 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: medical VLMs, paraphrase consistency, image grounding, deployment safety, text shortcuts, four-quadrant taxonomy

TL;DR: Models with the lowest flip rates have the highest fraction of predictions that are identical when the image is removed - consistency training makes medical VLMs appear reliable by teaching them to ignore the image.

Abstract: Consistency under paraphrase, the property that semantically equivalent prompts yield identical predictions, is increasingly used as a proxy for reliability when deploying medical vision-language models (VLMs). We show this proxy is fundamentally flawed: a model can achieve perfect consistency by relying on text patterns rather than the input image. We introduce a \emph{four-quadrant} per-sample safety taxonomy that jointly evaluates \textbf{consistency} (stable predictions across paraphrased prompts) and \textbf{image reliance} (predictions that change when the image is removed). Samples are classified as \emph{Ideal} (consistent and image-reliant), \emph{Fragile} (inconsistent but image-reliant), \emph{Dangerous} (consistent but not image-reliant), or \emph{Worst} (inconsistent and not image-reliant). Evaluating five medical VLM configurations across two chest X-ray datasets (MIMIC-CXR, PadChest), we find that LoRA fine-tuning dramatically reduces flip rates but shifts a majority of samples into the Dangerous quadrant: LLaVA-Rad Base achieves a 1.5\% flip rate on PadChest while 98.5\% of its samples are Dangerous. Critically, Dangerous samples exhibit \emph{high accuracy} (up to 99.6\%) and \emph{low entropy}, making them invisible to standard confidence-based screening. We observe a negative correlation between flip rate and Dangerous fraction ($r=-0.89$, $n{=}10$) and recommend that deployment evaluations always pair consistency checks with a text-only baseline: a single additional forward pass that exposes the false reliability trap.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 23

Loading