Keywords: discharge summaries, clinical documentation quality, automated auditing, large language models, MIMIC-IV, care transitions, benchmark
TL;DR: We introduce a preliminary benchmark of 46 audit questions operationalized from the DISCHARGED framework, on 50 MIMIC-IV summaries, evaluating 11 LLMs. Best model achieves moderate agreement (κ=0.496); all fail to detect ambiguous documentation.
Abstract: Incomplete or inconsistent discharge documentation drives care fragmentation and avoidable readmissions. Despite its critical role in patient safety, auditing discharge summaries relies on manual review and does not scale. We propose an automated framework for auditing discharge summaries using large language models (LLMs). Our approach operationalizes the DISCHARGED framework into a checklist of 46 questions. Using 50 summaries from the MIMIC-IV database, with clinician ground-truth labels, we benchmark 11 LLMs. Model-assessed mean documentation completeness ranges from 54.9% to 74.2%, and the best-performing model achieves a Cohen’s κ of 0.496 against clinician labels, indicating moderate agreement. All models struggle to identify ambiguous documentation (Unclear), highlighting a key gap in current automated auditing. This work provides a clinician-validated benchmark and zero-shot baselines for systematic quality improvement in clinical documentation.
Submission Number: 111
Loading