Abstract: Content moderation has traditionally relied on single models trained on labeled datasets, later evolving to systems with explicit safety instructions. Recent approaches include specialized fine-tuned models like LlamaGuard and ShieldGemma, as well as Chain-of-Thought (CoT) reasoning techniques that enable structured analysis within a single model. However, these approaches still lack robust verification mechanisms, leading to inconsistent safety decisions when faced with toxic input. This paper introduces a novel multi-agent framework that fundamentally redefines content moderation through collaborative reasoning among specialized agents. Instead of relying on the judgment of a single model, our approach uses multiple agents with distinct roles. These agents engage in explicit dialogue to collectively examine user prompts and LLM responses, ultimately providing moderation through distributive cognitive reasoning. Through extensive testing on multiple benchmark datasets, we observed that our collaborative approach achieved 4-11\% higher accuracy compared to both CoT and specialized content moderation tools like LlamaGuard and ShieldGemma. Our multi-agent framework consistently demonstrates superior results in correctly identifying both safe and harmful content while maintaining lower false positive rates. The transparent inter-agent dialogue provides detailed explanations for moderation decisions, enhancing the interpretability and reliability of AI content moderation systems.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: LLM/AI agents, chain-of-thought. prompting, safety and alignment
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 5207
Loading