Leveraging Evidence-Guided LLMs to Enhance Trustworthy Depression Diagnosis

Yining Yuan; J. Ben Tamo; Micky C. Nnamdi; Yifei Wang; May Dongmei Wang

Leveraging Evidence-Guided LLMs to Enhance Trustworthy Depression Diagnosis

Yining Yuan, J. Ben Tamo, Micky C. Nnamdi, Yifei Wang, May Dongmei Wang

Published: 19 Aug 2025, Last Modified: 12 Oct 2025BHI 2025EveryoneRevisionsBibTeXCC BY 4.0

Confirmation: I have read and agree with the IEEE BHI 2025 conference submission's policy on behalf of myself and my co-authors.

Keywords: Large Language Models, Diagnostic Reasoning, Mental Health AI, DSM-5, Knowledge Graphs, Explainable AI, Evidence-Guided Reasoning

Abstract: Large language models (LLMs) show promise in automating clinical diagnosis, yet their non-transparent decision-making and limited alignment with diagnostic standards hinder trust and clinical adoption. We address this challenge by proposing a two-stage diagnostic framework that enhances transparency, trustworthiness, and reliability. First, we introduce evidence-guided diagnostic reasoning (EGDR), which guides LLMs in generating structured diagnostic hypotheses by interleaving evidence extraction and logical reasoning, grounded in DSM-5 criteria. Second, we propose a Diagnosis Confidence Scoring (DCS) module that evaluates the factual accuracy and logical consistency of generated diagnoses through two interpretable metrics: Knowledge Attribution Score (KAS) and Logic Consistency Score (LCS). Evaluated on the D4 dataset with pseudo-labels, EGDR outperforms Direct in-context prompting and Chain-of-Thought (CoT) across five LLMs. For instance, on OpenBioLLM, EGDR improves accuracy from 0.31 (Direct) to 0.76 and DCS from 0.50 to 0.67. On MedLlama, DCS rises from 0.58 (CoT) to 0.77. EGDR yields up to +45% accuracy and +36% DCS gains over baselines, offering a clinically grounded, interpretable foundation for trustworthy AI-assisted diagnosis.

Track: 4. Clinical Informatics

Registration Id: 6QNYQKQMFZ3

Submission Number: 370

Loading