Right for the Wrong Reasons: A Benchmark for Hallucination and Clinical Safety in AI Health Triage

23 May 2026 (modified: 26 May 2026)VLDB 2026 Workshop BioDMS SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Submission Type: Project Talk Proposal
Keywords: clinical AI, hallucination, data management, LLM evaluation, medical triage, patient safety, reproducible benchmark
Abstract: Consumer-facing AI systems now advise millions of patients on the urgency of their health conditions, yet their evaluation has largely focused on accuracy alone. Accuracy, however, is insufficient for high-stakes health decision support: models must provide sound, contextually grounded reasoning because their explanations can directly influence whether patients seek timely care or delay treatment. To address this important gap, we introduce an empirical benchmark for evaluating clinical triage systems that assesses explanation quality alongside decision outcomes. We compare 4 open-source language models against a ChatGPT Health baseline across 78 physician-labeled clinical vignettes spanning 19 medical domains. Beyond accuracy, we evaluate calibration, clinical reasoning coherence, faithfulness to clinical context, and over- and under-triage rates, dimensions critical to patient safety. Our results show that Gemma 2 9B achieves 79.5% accuracy compared to the top performer in our evaluation, ChatGPT Health (84.6%), which exhibits substantial fabrication in 69.2% of its explanations. We further observe strong prompt sensitivity: DeepSeek-R1 7B degrades by 20.5% under structured prompting, while Mistral 7B improves by 15.4%. We provide a reproducible, checkpoint-based evaluation pipeline and outline a roadmap for bias stress-testing, hallucination mitigation, and open benchmark release.
Submission Number: 6
Loading