PSI-Bench: Towards Interpretable and Clinically Grounded Evaluation of Depressive Patient Simulators

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: patient simulator, mental health, benchmark
Abstract: LLM-based patient simulators have emerged as promising tools for training novice counsellors. However, their automatic evaluation pipelines assess simulated responses using under-specified Likert-scale prompts on isolated profiles, failing to capture psychologically meaningful nuances and to assess whether simulators reflect the distributional diversity of real patient populations. To address these limitations, we introduce PSI-Bench, an automatic evaluation framework that systematically compares LLM-generated depressive patient conversations against real patient dialogues across turn-level, dialogue-level, and population-level dimensions. Drawing on established psycholinguistic and psychology findings, we design interpretable, clinically grounded metrics that expose where and why simulators succeed or fail. We benchmark two state-of-the-art simulator frameworks instantiated with different backend LLMs, demonstrate their divergence from real patient behavior, and show that our benchmark correlates with expert psychologist judgments. Our work highlights key limitations of current depressive patient simulators and provides a fast, interpretable, and extensible benchmark to guide future simulator design and evaluation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 99
Loading