The Hallucination Dependence Index: A Cross-Condition Diagnostic for Clinical-LLM Faithfulness

Published: 28 May 2026, Last Modified: 28 May 2026ICML 2026 FM4LS Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: clinical LLM, LLM faithfulness, hallucination evaluation, Hallucination Dependence Index, cross-condition evaluation, foundation models for biomedical summarization, multimodal evidence grounding, TCGA-LUAD, LLM-as-judge, paired bootstrap, reproducible benchmarks
TL;DR: Hallucination Dependence Index (HDI), a clinical-LLM faithfulness metric, measures how much of a model's ungrounded hallucination grounding suppresses; it inverts safety rankings on 2×2 factorial of frontier LLMs on TCGA-LUAD oncology cases.
Abstract: We introduce the Hallucination Dependence Index (HDI), a paired metric that reports the fraction of a clinical LLM's ungrounded hallucination that grounding actually suppresses, computed under bit-for-bit identical prompts with only the evidence bundle substituted between conditions. Standard clinical-LLM faithfulness evaluations report a single supported-claim rate, which by construction cannot tell a model that reads the bundle apart from one whose pretraining priors already match the cohort; HDI plus an embedding-overlap probe distinguishes calibrated refusal from silent prior-recycling. We instantiate HDI in a cross-condition harness on 119 TCGA-LUAD cases anchored to a two-expert consensus adjudication ($\kappa = 0.64$; independent inter-annotator $\kappa = 0.71$), with a fixed external LLM judge (gpt-4.1-mini, in neither factorial row) that neutralizes self-preference bias. Across gpt-4o-mini, gpt-5.4-mini, gemini-2.5-flash, and the gemini-3-flash *preview* endpoint, HDI spans 0.336–0.984 while grounded support compresses to 81.9–93.2%, inverting safety rankings; gpt-5.4-mini's low HDI reflects calibrated refusal, not prior-recycling (3.2% semantic overlap). Pairwise patient-paired bootstrap separates all six model pairs at Holm-corrected $p \leq 0.009$. These structural findings are invisible to grounded-only metrics.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 86
Loading