Can LLMs Move Beyond Short Exchanges to Realistic Therapy Conversations?

Published: 26 Jan 2026, Last Modified: 02 Mar 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Real-word Counseling, CBT Therapy, Mental Health
TL;DR: We introduce CareBench-CBT, the largest clinically validated benchmark for CBT-based counseling.
Abstract: Recent incidents have revealed that large language models (LLMs) deployed in mental health contexts can generate unsafe guidance, including reports of chatbots encouraging self-harm. Such risks highlight the urgent need for rigorous, clinically valid evaluation before integration into care. However, existing benchmarks remain inadequate: 1) they rely on synthetic or weakly validated data, undermining clinical reliability; 2) they reduce counseling to isolated QA or single-turn tasks, overlooking the extended, adaptive nature of real interactions; and 3) they rarely capture the formal therapeutic structure of sessions. These gaps risk overestimating LLM competence and obscuring safety-critical failures. To address this, we present \textbf{CareBench-CBT}, the largest clinically validated benchmark for CBT-based counseling, unifying thousands of expert-curated items, realistic multi-turn dialogues, and formal CBT structural alignment. Evaluating 18 state-of-the-art LLMs reveals consistent gaps: high scores on public QA degrade under expert rephrasing, vignette reasoning remains difficult, and dialogue competence falls well below human counselors. Recognizing that long-horizon context management limits multi-turn performance, we further propose Hierarchical Therapy Memory (HTM), a training-free inference framework that structures dialogue history into global states and episodic summaries. HTM consistently improves session-level therapeutic coherence while reducing computational latency. Together, CareBench-CBT and HTM provide a rigorous foundation for advancing the safe and responsible integration of LLMs into mental health care. All code and data are released in the Supplementary Materials.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 12306
Loading