Abstract: The booming development of medical large-scale language models (LLMs) enables users to complete preliminary medical consultations (self-diagnosis) in their daily lives. Recent evaluations of medical LLMs mainly focus on their ability to complete medical tasks, pass medical examinations, or obtain a favorable GPT-4 rating. There are still challenges in using them to provide directions for improving medical LLMs, including misalignment with practical use, lack of depth in exploration, and over-reliance on GPT-4. To address the above issues, we construct a fact-checking style Self-Diagnostic Atomic Knowledge (SDAK) benchmark. Through atomic knowledge that is close to real usage scenarios, it can more accurately, reliably, and fundamentally evaluate the memorization ability of medical LLMs for medical knowledge. Furthermore, we design three cascading metrics for comprehensive evaluation without using GPT-4. The experimental results show that Chinese medical LLMs still have much room for improvement in self-diagnostic atomic knowledge. Error analysis reveals that sycophancy is the primary cause of errors, whether it is in general or medical LLMs. We further explore different types of data commonly adopted for fine-tuning medical LLMs and find that distilled data enhances medical knowledge retention more effectively than real-world doctor-patient conversations.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: Data resources, Data analysis
Languages Studied: Chinese
0 Replies
Loading