Testing and Evaluation of Generative Large Language Models in Electronic Health Record Applications: A Systematic Review

Xinsong Du, Zhengyang Zhou, Yifei Wang, Ya-Wen Chuang, Yiming Li, Richard Yang, Wenyu Zhang, Xinyi Wang, Xinyu Chen, Hao Guan, John Lian, Pengyu Hong, David W. Bates, Li Zhou

Published: 12 Aug 2024, Last Modified: 24 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: h3>Abstract</h3> <h3>Background</h3> <p>The use of generative large language models (LLMs) with electronic health record (EHR) data is rapidly expanding to support clinical and research tasks. This systematic review synthesizes current strategies, challenges, and future directions for adapting and evaluating generative LLMs in EHR analyses and applications.</p><h3>Methods</h3> <p>We followed the PRISMA guidelines to conduct a systematic review of articles from PubMed and Web of Science published between January 1, 2023, and November 9, 2024. Studies were included if they used generative LLMs to analyze real-world EHR data and reported quantitative performance evaluations. Through data extraction, we identified clinical specialties and tasks for each included article, and summarized evaluation methods.</p><h3>Results</h3> <p>Of the 18,735 articles retrieved, 196 met our criteria. Most studies focused on Radiology (26.0%), Oncology (10.7%), and Emergency Medicine (6.6%). Regarding clinical tasks clinical decision support has the most studies of 62.2%, while summarizations and patient communications have the least studies of 5.6% and 5.1% separately. In addition, GPT-4 and ChatGPT were mostly used generative LLMs, which were used in 60.2% and 57.7% of studies, respectively. We identified 22 unique non-NLP metrics and 35 unique NLP metrics. Although NLP metrics have better scalability, none of the metrics were identified as having a strong correlation with gold-standard human evaluations.</p><h3>Conclusion</h3> <p>Our findings highlight the need to evaluate generative LLMs on EHR data across a broader range of clinical specialties and tasks, as well as the urgent need for standardized, scalable, and clinically meaningful evaluation frameworks.</p>
Loading