CHBench: A Chinese Dataset for Evaluating Health in Large Language Models

ACL ARR 2025 May Submission1196 Authors

16 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: With the rapid development of large language models (LLMs), assessing their performance on health-related inquiries has become increasingly essential. The use of these models in real-world contexts—where misinformation can lead to serious consequences for individuals seeking medical advice and support—necessitates a rigorous focus on safety and trustworthiness. In this work, we introduce CHBench, the first comprehensive safety-oriented Chinese health-related benchmark designed to evaluate LLMs' capabilities in understanding and addressing physical and mental health issues with a safety perspective across diverse scenarios. Rather than focusing on medical or diagnostic tasks, CHBench highlights safety-related concerns such as risk awareness and appropriate behavioral guidance in everyday health contexts. CHBench comprises 6,493 entries on mental health and 2,999 entries on physical health, spanning a wide range of topics. Our extensive evaluations of four popular Chinese LLMs highlight significant gaps in their capacity to deliver safe and accurate health information, underscoring the urgent need for further advancements in this critical domain.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: NLP datasets
Contribution Types: Data resources
Languages Studied: Chinese
Keywords: safety, benchmark
Submission Number: 1196
Loading