GMP Bench: A Benchmark for Generalizing to Complex and Dynamic Moderation Policies

GMP Bench: A Benchmark for Generalizing to Complex and Dynamic Moderation Policies

ICLR 2026 Conference Submission16265 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Safety, Content Moderation, LLM, Benchmark

Abstract: Content moderation is crucial for maintaining safe online environments, yet the growing reliance on Large Language Models (LLMs) for this task is limited by inadequate evaluation methods. Existing benchmarks for content moderation suffer from a fundamental weakness: they are built upon mutually exclusive and static rules, thus failing to capture the complex and dynamic nature of real-world violations. To address this, we introduce the Generalized Moderation Policy (GMP) Benchmark, the first framework to systematically evaluate model generalization to multifaceted and evolving policies. GMP features two core tasks: (1) \textbf{Identifying Complex Violations}, which requires models to identify all co-occurring violation types in a single content piece; and (2) \textbf{Adapting to Dynamic Rules}, which assesses a model's on-the-fly reasoning with novel, context-specific policies. Our comprehensive evaluation of over 20 SOTA LLMs on the GMP benchmark reveals two critical deficiencies: (1) even top-tier models struggle to comprehensively identify all co-occurring harms, showing a particular weakness in detecting long-tail safety risks; and (2) their performance fluctuates significantly when faced with dynamic rules, indicating a critical gap in true policy adherence. These findings highlight the urgent need for more robust and generalizable AI moderation systems.

Primary Area: datasets and benchmarks

Submission Number: 16265

Loading