Keywords: LLM Safety, Content Moderation, LLM, Benchmark
Abstract: Content moderation is crucial for maintaining safe online environments, yet the growing reliance on Large Language Models (LLMs) for this task is limited by inadequate evaluation methods. Existing benchmarks for content moderation suffer from a fundamental weakness: they are built upon mutually exclusive and static rules, thus failing to capture the complex and dynamic nature of real-world violations. To address this, we introduce the Generalized Moderation Policy (GMP) Benchmark, the first framework to systematically evaluate model generalization to multifaceted and evolving policies. GMP features two core tasks: (1) \textbf{Identifying Complex Violations}, which requires models to identify all co-occurring violation types in a single content piece; and (2) \textbf{Adapting to Dynamic Rules}, which assesses a model's on-the-fly reasoning with novel, context-specific policies. Our comprehensive evaluation of over 20 SOTA LLMs on the GMP benchmark reveals two critical deficiencies: (1) even top-tier models struggle to comprehensively identify all co-occurring harms, showing a particular weakness in detecting long-tail safety risks; and (2) their performance fluctuates significantly when faced with dynamic rules, indicating a critical gap in true policy adherence. These findings highlight the urgent need for more robust and generalizable AI moderation systems.
Primary Area: datasets and benchmarks
Submission Number: 16265
Loading