MSDiagnosis: A Benchmark for Evaluating Large Language Models in Multi-Step Clinical Diagnosis

ACL ARR 2024 December Submission1693 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Clinical diagnosis is critical in medical practice, typically requiring a continuous and evolving process that includes primary diagnosis, differential diagnosis, and final diagnosis. However, most existing clinical diagnostic tasks are single-step processes, which does not align with the complex multi-step diagnostic procedures found in real-world clinical settings. In this paper, we propose a Chinese clinical diagnostic benchmark, called MSDiagnosis. This benchmark consists of 2,225 cases from 12 departments, covering tasks such as primary diagnosis, differential diagnosis, and final diagnosis. Additionally, we propose a novel and effective framework. This framework combines forward inference, backward inference, reflection, and refinement, enabling the large language model to self-evaluate and adjust its diagnostic results. To this end, we test open-source models, closed-source models, and our proposed framework.The experimental results demonstrate the effectiveness of the proposed method. We also provide a comprehensive experimental analysis and suggest future research directions for this task. The dataset and codes are available at the anonymous URL~\url{https://anonymous.4open.science/r/MDQA-6EC0/}.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: healthcare applications, clinical NLP, benchmarking, biomedical QA
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: Chinese
Submission Number: 1693
Loading