Abstract: LLM-based agents are increasingly deployed for mental-health support and crisis counselling, yet recent evaluations reveal that commercial therapy chatbots respond appropriately only about half the time in clinical scenarios. Clinicians and safety engineers are called upon to audit these systems, but existing tools do not surface the subtler psychosocial harms (manipulation, discrimination, psychological distress) nor produce the explainable rationales that practitioners need. We present DialogGuard, an open-source system that lets practitioners inspect, stress-test, and create audit trails for prompted LLM agents across five psychosocial safety dimensions. The system wraps around arbitrary generative models through four LLM-as-a-judge pipelines (single-agent scoring, dual-agent correction, multi-agent debate, and majority voting), each grounded in shared three-level rubrics. Through its web interface, practitioners evaluate agents in two modes (Live Chat and Manual Input) and review per-dimension risk scores with natural-language rationales. Experiments on PKU-SafeRLHF show that dual-agent correction provides the best accuracy-robustness trade-off, and a formative study with 12~practitioners confirms that the system supports prompt auditing, safety inspection, and supervisory decision-making. Code and demo: https://github.com/lhannnn/dialogguard-web.
Loading