PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models

ACL ARR 2024 June Submission1901 Authors

15 Jun 2024 (modified: 04 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The emergence of Large Language Models (LLMs) in the medical domain has stressed a compelling need for standard datasets to evaluate their performance. Although there have been several benchmark datasets for medical problems, they either cover common knowledge across different departments or are specific to another department rather than pediatrics. Moreover, some of them are limited to objective questions and do not measure the generation capacity of LLMs. Therefore, they cannot comprehensively assess the ability of LLMs in pediatrics. To fill this gap, we construct PediaBench, the first Chinese pediatric dataset for LLM evaluation. Specifically, it contains 4,565 objective questions and 1,632 subjective questions spanning 12 pediatric disease groups. It adopts an integrated scoring criterion based on five types of questions to thoroughly assess the proficiency of an LLM in instruction following, knowledge understanding, clinical case analysis, etc. Finally, we validate the effectiveness of PediaBench with extensive experiments on 21 open-source and commercial LLMs. Through an in-depth analysis of experimental results, we offer insights into the ability of LLMs to handle pediatric questions in the Chinese context, highlighting their limitations for further improvements. Our code and data are published anonymously at https://anonymous.4open.science/r/PediaBench-E8E2.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Benchmark, Large Language Models, Pediatrics
Contribution Types: Data resources
Languages Studied: English, Chinese
Submission Number: 1901
Loading