Benchmarking LLMs on Authentic Cases from Medical Journals

Benchmarking LLMs on Authentic Cases from Medical Journals

ACL ARR 2025 May Submission4588 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In recent years, large language models (LLMs) have demonstrated remarkable capabilities in the medical domain. However, existing medical benchmarks suffer from performance saturation and are predominantly derived from medical exam questions, which fail to adequately capture the complexity of real-world clinical scenarios. To bridge this gap, we introduce ClinBench, a challenging benchmark based on authentic clinical cases sourced from authoritative medical journals. Each question retains the complete patient information and clinical test results from the original case, effectively simulating real-world clinical practice. Additionally, we implement a rigorous human review process involving medical experts to ensure the quality and reliability of the benchmark. ClinBench supports both textual and multimodal evaluation formats, covering 12 medical specialties with over 2,000 questions, which provides a comprehensive benchmark for assessing LLMs’ medical capabilities. We evaluate the performance of over 20 open-source and proprietary LLMs and benchmark them against human medical experts. Our findings reveal that human experts still retain an advantage within their specialized fields, while LLMs demonstrate superior overall performance on a broader range of medical specialties.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Medical Benchmark, Large Language Models, Real Clinical Scenarios

Contribution Types: Data resources, Data analysis

Languages Studied: English

Submission Number: 4588

Loading