When the Right Answer Is Missing: Probing Hallucination in Medical Reasoning Models via NOTA Evaluation
Keywords: medical visual question answering, hallucination probing, NOTA evaluation, uncertainty calibration, chest X-ray VQA, multiturn prompting, overconfidence in medical AI, reasoning model robustness, clinical AI safety, medical multimodal reasoning
TL;DR: Medical Reasoning VLMs collapse under NOTA probing (r=0.113 with base accuracy); medically fine-tuned models drop up to 41 pp; multiturn prompting recovers +36 pp, no retraining needed.
Abstract: Standard benchmarks evaluate medical vision language models by asking "How often is the model correct?'' but rarely ask ""How does the model behave when it cannot be correct?" We address this gap by evaluating 19 models on the ReXVQA chest X-ray benchmark using NOTA (None of the Above) and No Answer adversarial variants across 13 clinical task and category combinations. Base accuracy is uncorrelated with NOTA resilience (r=0.079), and medically fine tuned models show the largest hallucination collapses, exceeding 40 percentage points. We propose two retraining-free mitigations: multiturn prompting, which recovers up to 36pp on NoAnswer for non-thinking models, and prompt optimization, which recovers up to 27pp on NoAnswer for tokenwise reasoning models, revealing complementary mechanisms tied to model architecture.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 17
Loading