Abstract: Medical Vision Language Models VLMs suffer from two failure modes that threaten safe deployment: mis-calibrated confidence and sensitivity to question rephrasing. We show they share a common cause, proximity to the decision boundary, by benchmarking five uncertainty quantification methods on MedGemma 4BIT across in-distribution MIMIC-CXR and out-of-distribution PadChest chest X-ray datasets, with cross-architecture validation on LLaVA-RAD 7B. For well-calibrated single-model methods, predictive entropy from one forward pass predicts which samples will flip under rephrasing (AUROC 0.711 on MedGemma, 0.878 on LLaVA-RAD; p < 10^-4), enabling a single entropy threshold to flag both unreliable and rephrase-sensitive predictions. A five-member LoRA ensemble fails under the MIMIC to PadChest shift (42.9% ECE, 34.1% accuracy), though LLaVA-RAD's ensemble does not collapse (69.1%). MC Dropout achieves the best calibration (ECE 4.3) and selective prediction coverage (21.5% at 5% risk), yet total entropy from a single forward pass outperforms the ensemble for both error detection (AUROC 0.743 vs 0.657) and paraphrase screening. Simple methods win.
Loading