Multi-Agent Framework for Conversational Safety

Multi-Agent Framework for Conversational Safety

ACL ARR 2025 May Submission5207 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Content moderation has traditionally relied on single models trained on labeled datasets, later evolving to systems with explicit safety instructions. Recent approaches include specialized fine-tuned models like LlamaGuard and ShieldGemma, as well as Chain-of-Thought (CoT) reasoning techniques that enable structured analysis within a single model. However, these approaches still lack robust verification mechanisms, leading to inconsistent safety decisions when faced with toxic input. This paper introduces a novel multi-agent framework that fundamentally redefines content moderation through collaborative reasoning among specialized agents. Instead of relying on the judgment of a single model, our approach uses multiple agents with distinct roles. These agents engage in explicit dialogue to collectively examine user prompts and LLM responses, ultimately providing moderation through distributive cognitive reasoning. Through extensive testing on multiple benchmark datasets, we observed that our collaborative approach achieved 4-11\% higher accuracy compared to both CoT and specialized content moderation tools like LlamaGuard and ShieldGemma. Our multi-agent framework consistently demonstrates superior results in correctly identifying both safe and harmful content while maintaining lower false positive rates. The transparent inter-agent dialogue provides detailed explanations for moderation decisions, enhancing the interpretability and reliability of AI content moderation systems.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: LLM/AI agents, chain-of-thought. prompting, safety and alignment

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 5207

Loading