Keywords: Safety, LLM Safeguard, Content Moderation
TL;DR: We propose GSPR, a generalizable safety policy reasoner to identify unsafe input prompts and LLMs' outputs with fine-grained safety taxonomies.
Abstract: As large language models (LLMs) are increasingly integrated into numerous applications across various domains, LLMs’ safety becomes a critical concern for both application developers and intended users. Currently, great efforts have been made to develop safety benchmarks with fine-grained taxonomies. However, these benchmarks’ taxonomies are disparate with different safety policies.
Thus, existing safeguards trained on these benchmarks are either coarse-grained to only distinguish between “safe” and “unsafe,” or constrained by the narrow risk taxonomies of a single benchmark. To leverage these fine-grained safety taxonomies across multiple safety benchmarks, in this paper, we propose GSPR, a Generalizable Safety Policy Reasoner to identify unsafe input prompts and LLMs’
outputs with violated safety taxonomies through Group Relative Policy Optimization (GRPO).
Unlike prior safeguards which only cover a fixed set of risk factors, our GSPR incentivizes its reasoning capability with varied safety taxonomies through our careful cold-start strategy and reward design.
Consequently, our GSPR can be trained across multiple safety benchmarks with distinct taxonomies and naturally exhibits powerful generalization ability. We conduct extensive experiments to show that our GSPR significantly improves existing safety guardrails’
reasoning capabilities for both safety and category prediction tasks.
Moreover, our GSPR not only demonstrates powerful safety generalization abilities but also achieves the least inference token costs with explanations.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 7181
Loading