MedJourney: Benchmark and Evaluation of Large Language Models over Patient Clinical Journey

Xian Wu; Yutian Zhao; Yunyan Zhang; Jiageng Wu; Zhihong Zhu; Yingying Zhang; Yi Ouyang; Ziheng Zhang; Huimin WANG; Zhenxi Lin; Jie Yang; Shuang Zhao; Yefeng Zheng

MedJourney: Benchmark and Evaluation of Large Language Models over Patient Clinical Journey

Xian Wu, Yutian Zhao, Yunyan Zhang, Jiageng Wu, Zhihong Zhu, Yingying Zhang, Yi Ouyang, Ziheng Zhang, Huimin WANG, Zhenxi Lin, Jie Yang, Shuang Zhao, Yefeng Zheng

Published: 26 Sept 2024, Last Modified: 13 Nov 2024NeurIPS 2024 Track Datasets and Benchmarks PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Clinical Jourey, Benchmark, Evaluation, Medical

TL;DR: a benchmark that evaluates the performance of LLMs within the entire patient clinical journey

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in language understanding and generation, leading to their widespread adoption across various fields. Among these, the medical field is particularly well-suited for LLM applications, as many medical tasks can be enhanced by LLMs. Despite the existence of benchmarks for evaluating LLMs in medical question-answering and exams, there remains a notable gap in assessing LLMs' performance in supporting patients throughout their entire hospital visit journey in real-world clinical practice. In this paper, we address this gap by dividing a typical patient's clinical journey into four stages: planning, access, delivery and ongoing care. For each stage, we introduce multiple tasks and corresponding datasets, resulting in a comprehensive benchmark comprising 12 datasets, of which five are newly introduced, and seven are constructed from existing datasets. This proposed benchmark facilitates a thorough evaluation of LLMs' effectiveness across the entire patient journey, providing insights into their practical application in clinical settings. Additionally, we evaluate three categories of LLMs against this benchmark: 1) proprietary LLM services such as GPT-4; 2) public LLMs like QWen; and 3) specialized medical LLMs, like HuatuoGPT2. Through this extensive evaluation, we aim to provide a better understanding of LLMs' performance in the medical domain, ultimately contributing to their more effective deployment in healthcare settings.

Flagged For Ethics Review: true

Submission Number: 1582

Loading