Understanding the limitations of medical reasoning in large language models
Keywords: Medical diagnosis, LLM fragility, LLM robustness, LLM benchmark
TL;DR: Large language models show impressive benchmark performance, but concerns remain about their real-world readiness in healthcare due to fragility when evaluated on clinician-validated perturbations reflecting authentic clinical data complexities.
Track: Findings
Abstract: Large language models demonstrate impressive performance on standardized healthcare benchmarks, yet their deployment readiness for real-world environments remains poorly understood. Current medical benchmarks present idealized scenarios that misrepresent the complexity of actual clinical data. We systematically evaluate LLM robustness by introducing clinician-validated perturbations to MedQA that mirror authentic healthcare settings: medically irrelevant information (red herrings), clinical writing styles, and standard medical abbreviations. Our comprehensive evaluation across nine models reveals substantial fragility, with diagnostic accuracy dropping up to 9.4\%. Notably, semantic distractions pose the greatest threat, while some models demonstrate relative resilience to stylistic variations and medical abbreviations. Our paper addresses a gap between benchmark performance and clinical deployment readiness, while providing a systematic framework for assessing AI robustness that can be generalized to other healthcare domains.
General Area: Applications and Practice
Specific Subject Areas: Natural Language Processing, Deployment, Algorithmic Fairness & Bias
PDF: pdf
Data And Code Availability: Not Applicable
Ethics Board Approval: No
Entered Conflicts: I confirm the above
Anonymity: I confirm the above
Submission Number: 28
Loading