From Empathy to Action: Benchmarking LLMs in Mental Health with MentalBench-10 and a Novel Cognitive-Affective Evaluation Approach

From Empathy to Action: Benchmarking LLMs in Mental Health with MentalBench-10 and a Novel Cognitive-Affective Evaluation Approach

ACL ARR 2025 May Submission4316 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Evaluating Large Language Models (LLMs) for mental health support poses unique challenges due to the emotionally sensitive and cognitively complex nature of therapeutic conversations. Widely used automatic metrics (e.g., ROUGE) fail to capture therapeutic attributes such as empathy and safety, and often misrepresent the true quality of LLM-generated responses. Meanwhile, human evaluation, although more accurate, remains costly, time-consuming, and limited in scalability. There is also a lack of real-world benchmarks for mental healthcare. To this end, we introduce MentalBench-10, a large real-world benchmark for mental health dialogue evaluation, comprising 10,000 conversations sourced from real therapeutic exchanges and annotated with responses from one human and nine LLMs from available datasets. To evaluate these responses, we propose a clinically grounded dual-axis evaluation using Cognitive Support Score (CSS) and Affective Resonance Score (ARS), supported by both human experts and multiple LLM-based judges. Our findings reveal that LLMs match or exceed human responses, especially in cognitive dimensions such as relevance and safety. However, affective traits, such as empathy, remain challenging, particularly for open-source models. We further quantify judge reliability using an Alignment Factor that measures agreement between human and LLM-based ratings. This work not only highlights the growing competency of LLMs in mental health tasks but also provides a robust, scalable framework for future evaluations. We will release MentalBench-10, along with evaluation results from human annotators and LLMs as judges.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Resources and Evaluation, NLP Applications, Language Modeling, Human-Centered NLP, Generation

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 4316

Loading