GuardReasoner: Towards Reasoning-based LLM Safeguards

Published: 06 Mar 2025, Last Modified: 27 Mar 2025ICLR 2025 FM-Wild WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, AI Safety, Guard Model, LLM Reasoning, Hard Sample Mining
Abstract: This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. To this end, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. Furthermore, we use the tuned models to mine the hard samples and present hard sample DPO to strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalization ability. The extensive experiments and analyses on 13 guardrail benchmarks demonstrate the superiority of GuardReasoner. Remarkably, it surpasses GPT-4o+CoT by 5.65% and LLaMA Guard 3 8B by 21.02% in terms of F1 score on average. We release the training data, codes, and models with different scales (1B, 3B, 8B) of GuardReasoner.
Submission Number: 3
Loading