Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis
Keywords: large language models, safety evaluations, mental health, psychosis, LLM-as-a-Judge, LLM-as-a-Jury, chatbots, evaluation metrics, automated assessment
Abstract: General-purpose Large Language Models (LLMs) are becoming widely adopted by people for mental health support. Yet emerging evidence suggests there are significant risks associated with high-frequency use, particularly for individuals suffering from psychosis, as LLMs may reinforce delusions and hallucinations. Existing evaluations of LLMs in mental health contexts are limited by a lack of clinical validation and scalability of assessment. To address these issues, this research focuses on psychosis as a critical condition for LLM safety evaluation by (1) developing and validating seven clinician-informed safety criteria, (2) constructing a human-consensus dataset, and (3) testing automated evaluation via LLM-as-a-Judge and LLM-as-a-Jury. Results indicate that LLM-as-a-Judge aligns closely with the human consensus (Cohen’s $κ_{\text{human}\times\text{gemini}} = 0.76$) and that it outperforms LLM-as-a-Jury (Cohen’s $κ_{\text{human}\times\text{jury}} = 0.71$). These findings have promising implications for clinically grounded, scalable methods in LLM safety evaluations for mental health contexts.
Submission Number: 75
Loading