Abstract: Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error.
Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments.
However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning).
To address these issues, we present **LLMEval-Med**icine, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios.
We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability.
We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: Evaluation, Medical, LLMs
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 3842
Loading