Can LLMs Move Beyond Short Exchanges to Realistic Therapy Conversations?

ICLR 2026 Conference Submission12306 Authors

Published: 26 Jan 2026, Last Modified: 26 Jan 2026ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Real-word Counseling, CBT Therapy, Mental Health
TL;DR: We introduce CareBench-CBT, the largest clinically validated benchmark for CBT-based counseling.
Abstract: Recent incidents have revealed that large language models (LLMs) deployed in mental health contexts can generate unsafe guidance, including reports of chatbots encouraging self-harm. Such risks highlight the urgent need for rigorous, clinically valid evaluation before integration into care. However, existing benchmarks remain inadequate: 1) they rely on synthetic or weakly validated data, undermining clinical reliability; 2) they reduce counseling to isolated QA or single-turn tasks, overlooking the extended, adaptive nature of real interactions; and 3) they rarely capture the formal therapeutic structure of sessions, such as rapport building, guided exploration, intervention, and closure. These gaps risk overestimating LLM competence and obscuring safety-critical failures. To address this, we present \textbf{CareBench-CBT}, the largest clinically validated benchmark for CBT-based counseling. It unifies three components: 1) we provide thousands of expert-curated and validated items to ensure data reliability; 2) we include realistic multi-turn dialogues to capture long-form therapeutic interaction; and 3) we align all sessions with CBT’s formal structure, enabling process-level evaluation of empathy, therapeutic alignment, and intervention quality. All data are anonymized, double-reviewed by 21 licensed professionals, and validated with reliability and competence metrics. Evaluating 18 state-of-the-art LLMs reveals consistent gaps: high scores on public QA degrade under expert rephrasing, vignette reasoning remains difficult, and dialogue competence falls well below human counselors. CareBench-CBT provides a rigorous foundation for advancing safe and responsible integration of LLMs into mental health care. All code and data are released in the Supplementary Materials.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 12306
Loading