MSDiagnosis: A Chinese Benchmark for Evaluating Large Language Models in Multi-Step Clinical Diagnosis

ACL ARR 2025 February Submission2021 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

Clinical diagnosis is critical in medical practice, typically requiring a continuous and evolving process that includes primary diagnosis, differential diagnosis, and final diagnosis. However, most existing clinical diagnostic tasks are single-step processes, which does not align with the complex multi-step diagnostic procedures found in real-world clinical settings. In this paper, we propose a Chinese clinical diagnostic benchmark, called MSDiagnosis. This benchmark consists of 2,225 cases from 12 departments, covering tasks such as primary diagnosis, differential diagnosis, and final diagnosis. Additionally, we propose a novel and effective framework. This framework combines forward inference, backward inference, reflection, and refinement, enabling the large language model to self-evaluate and adjust its diagnostic results. To this end, we evaluate medical language models, general language models, and our proposed framework. The experimental results demonstrate the effectiveness of the proposed method. We also provide a comprehensive experimental analysis and suggest future research directions for this task. The dataset and codes are available at the anonymous URL https://anonymous.4open.science/r/MDQA-6EC0.

Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: healthcare applications, clinical NLP, benchmarking, biomedical QA
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: Chinese
Submission Number: 2021
Loading