PAIR-SAFE: A Paired-Agent Approach for Runtime Auditing and Refining AI-Mediated Mental Health Support

PAIR-SAFE: A Paired-Agent Approach for Runtime Auditing and Refining AI-Mediated Mental Health Support

ACL ARR 2026 January Submission4001 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Human-Centered AI, Mental Health NLP, AI Safety

Abstract: Large language models (LLMs) are increasingly used for mental health support, yet they can produce responses that are overly directive, inconsistent, or clinically misaligned, particularly in sensitive or high-risk contexts. Existing approaches to mitigating these risks largely rely on implicit alignment through training or prompting, offering limited transparency and runtime accountability. We introduce PAIR-SAFE, a paired-agent framework for auditing and refining AI-generated mental health support that integrates a Responder agent with a supervisory Judge agent grounded in clinically validated Motivational Interviewing Treatment Integrity (MITI-4) framework. The Judge audits each response and provides structured Allow or Revise decisions that guide runtime response refinement. We simulate counseling interactions using a support-seeker simulator derived from human-annotated motivational interviewing data. We find that Judge-supervised interactions achieve significant improvements in key MITI dimensions, including Partnership, Seek Collaboration, and overall Relational quality. Our quantitative findings are supported by qualitative expert evaluation, which further highlights the nuances of runtime supervision. Together, our results reveal that such paired-agent approach can provide clinically grounded auditing and refinement for AI-assisted conversational mental health support.

Paper Type: Long

Research Area: Human-AI Interaction/Cooperation and Human-Centric NLP

Research Area Keywords: Dialogue and Interactive Systems, Human-Centered NLP

Languages Studied: English

Submission Number: 4001

Loading