Multi-Agent Framework for Conversational Safety

Multi-Agent Framework for Conversational Safety

ACL ARR 2026 January Submission7668 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-agent systems, Content moderation, Large language models, Safety verification, Policy-based reasoning

Abstract: Content moderation systems have evolved from supervised learning to specialized fine-tuned models (LlamaGuard, ShieldGemma) and Chain-of-Thought (CoT) reasoning. Yet these single-model approaches lack robust verification mechanisms, leading to inconsistent safety decisions when evaluating harmful content. We introduce a novel multi-agent framework that redefines content moderation through collaborative reasoning among specialized agents, Safety Analyst, Task Analyst, and Judge, who engage in explicit dialogue to collectively evaluate prompts and responses. Our two-round dialogue protocol optimally balances verification quality with computational efficiency, where systematic ablation studies confirm that both role specialization and inter-agent collaboration are essential for superior performance. Extensive testing on multiple benchmark datasets shows that our approach achieved 4-11% higher accuracy compared to both CoT and specialized content moderation tools like LlamaGuard, ShieldGemma, etc., while maintaining comparable computational costs. The transparent inter-agent dialogue produces interpretable explanations for moderation decisions, enhancing both reliability and trustworthiness of AI content moderation systems.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: LLM/AI agents, safety and alignment, prompting, robustness, evaluation methodologies

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis

Languages Studied: English

Submission Number: 7668

Loading