Keywords: Diagnosis Summarization, Clinical Notes, Benchmark Dataset and Systematic Evaluation, Language Models
Abstract: Summarizing diagnoses from clinical notes (long, unstructured text) can reduce the diagnostic documentation burden and support handoffs, referrals, and follow-ups.
Towards this goal, our work introduces a reproducible benchmark and systematic evaluation for diagnosis summarization from clinical notes, with two input settings framing diagnosis summarization as a sequence-to-sequence generation task, with discharge diagnoses of notes serving as targets.
To reflect real-world clinical variability, two input formulations are: Summ-CPP, focusing on chief complaint, history of present illness, and past medical history; and Summ-Full-Note, which incorporates the entire note, excluding diagnoses.
Our study evaluates both general-purpose and domain-adapted language models under fine-tuned and zero-shot prompt configurations.
Pretrained language models (BART, T5, and SciFive) are fine-tuned on curated training notes, while large language models (e.g., GPT-4o-mini) are evaluated in a zero-shot prompt setting.
We will release a curated dataset on PhysioNet under the MIMIC-IV-notes data use agreement. Curated dataset consists of 10k samples, balanced across service types (medicine, surgery, orthopaedics, neurology, cardiothoracic, neurosurgery, obstetrics, psychiatry, urology, and plastic surgery) and divided into three splits: train (8k), validation (1k), and test (1k).
Performance is assessed using ROUGE and BERTScore to quantify both n-gram overlap and semantic similarity, with results summarized in Table 1.
Our results show that domain-adapted models are crucial for achieving clinically reliable performance.
Submission Number: 404
Loading