Why Benchmark Accuracy Fails to Measure Clinical Reasoning in Medical Vision Language Models: Toward Clinical Adversarial Validation
Keywords: Clinical adversarial validation, medical vision-language models, causal evaluation, shortcut learning, clinical reasoning
TL;DR: Static benchmarks overestimate medical vision-language reasoning; adversarial, causal validation is required to ensure robust, clinically reliable model behavior.
Abstract: Recent medical vision-language models achieve strong performance on widely used benchmarks, yet it remains unclear whether these scores reflect genuine clinical reasoning or reliance on spurious correlations. This paper argues that static, correlational benchmarking is structurally insufficient for evaluating medical vision-language systems intended for high-stakes clinical use. We frame this work as a position paper and provide a principled critique of existing evaluation practices, drawing on theoretical foundations of shortcut learning and empirical evidence from recent adversarial analyses of frontier models. To address these limitations, we propose Clinical Adversarial Validation, an evaluation paradigm that compliments passive test-set assessment with clinician-guided adversarial stress testing designed to probe causal reliance, multimodal grounding, and reasoning stability. The proposed framework emphasizes counterfactual perturbations, transparent reasoning audits, and calibrated abstention as core evaluation criteria. We argue that adopting adversarial, causally informed validation is necessary to bridge the gap between benchmark performance and real-world clinical readiness.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 14
Loading