Evaluating LLMs for Diagnosis Summarization

Joaquim Santos, Henrique D. P. dos Santos, Ana Helena D. P. S. Ulbrich, Daniela Faccio, Fábio de O. Tabalipa, Rodrigo Nogueira, Manuela Martins Costa

Published: 2024, Last Modified: 19 Feb 2025EMBC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: During a patient’s hospitalization, extensive information is documented in clinical notes. The efficient summarization of this information is vital for keeping healthcare professionals abreast of the patient’s status. This paper proposes a methodology to assess the efficacy of six large language models (LLMs) in automating the task of diagnosis summarization, particularly in discharge summaries. Our approach involves defining an automatic metric based on LLMs, highly correlated with human assessments. We evaluate the performance of the six models using the F1-Score and compare the results with those of healthcare specialists. The experiments reveal that there is room for improvement in the medical knowledge and diagnostic capabilities of LLMs. The source code and data for these experiments are available on the project’s GitHub page.