Can LLMs Move Beyond Short Exchanges to Realistic Therapy Conversations?

Zhengqing Yuan; Liang Wu; Jian Xu; Zheyuan Zhang; Kaiwen Shi; Weixiang Sun; Lichao Sun; Yanfang Ye

Can LLMs Move Beyond Short Exchanges to Realistic Therapy Conversations?

Zhengqing Yuan, Liang Wu, Jian Xu, Zheyuan Zhang, Kaiwen Shi, Weixiang Sun, Lichao Sun, Yanfang Ye

Published: 26 Jan 2026, Last Modified: 02 Mar 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Real-word Counseling, CBT Therapy, Mental Health

TL;DR: We introduce CareBench-CBT, the largest clinically validated benchmark for CBT-based counseling.

Abstract: Recent incidents have revealed that large language models (LLMs) deployed in mental health contexts can generate unsafe guidance, including reports of chatbots encouraging self-harm. Such risks highlight the urgent need for rigorous, clinically valid evaluation before integration into care. However, existing benchmarks remain inadequate: 1) they rely on synthetic or weakly validated data, undermining clinical reliability; 2) they reduce counseling to isolated QA or single-turn tasks, overlooking the extended, adaptive nature of real interactions; and 3) they rarely capture the formal therapeutic structure of sessions. These gaps risk overestimating LLM competence and obscuring safety-critical failures. To address this, we present \textbf{CareBench-CBT}, the largest clinically validated benchmark for CBT-based counseling, unifying thousands of expert-curated items, realistic multi-turn dialogues, and formal CBT structural alignment. Evaluating 18 state-of-the-art LLMs reveals consistent gaps: high scores on public QA degrade under expert rephrasing, vignette reasoning remains difficult, and dialogue competence falls well below human counselors. Recognizing that long-horizon context management limits multi-turn performance, we further propose Hierarchical Therapy Memory (HTM), a training-free inference framework that structures dialogue history into global states and episodic summaries. HTM consistently improves session-level therapeutic coherence while reducing computational latency. Together, CareBench-CBT and HTM provide a rigorous foundation for advancing the safe and responsible integration of LLMs into mental health care. All code and data are released in the Supplementary Materials.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 12306

Loading