Safety Reasoning with Guidelines

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Training safe LLMs remains a critical challenge. The most widely used method, Refusal Training (RT), struggles to generalize against various Out-of-Distribution (OOD) jailbreaking attacks. Although various advanced methods have been proposed to address this issue, we instead question whether OOD attacks inherently surpass the capability of vanilla RT. Evaluations using Best-of-N (BoN) reveal significant safety improvements as N increases, indicating models possess adequate latent safety knowledge but RT fails to consistently elicit it under OOD scenarios. Further domain adaptation analysis reveals that direct RT causes reliance on superficial shortcuts, resulting in non-generalizable representation mappings. Inspired by our findings, we propose training model to perform safety reasoning for each query. Specifically, we synthesize reasoning supervision aligned with specified guidelines that reflect diverse perspectives on safety knowledge. This encourages model to engage in deeper reasoning, explicitly eliciting and utilizing latent safety knowledge for each query. Extensive experiments show that our method significantly improves model generalization against OOD attacks.
Lay Summary: In this work, we analyze the reasons why the commonly used Refusal Training fails to generalize against OOD attacks and provide explanations for these failure modes. Based on our findings, we propose to train models to perform safety reasoning with specified guidelines, explicitly eliciting and utilizing latent knowledge from diverse perspective to learn generalizable representation mapping and improve OOD generalization. Extensive experiments and ablation studies verify the effectiveness of our method.
Primary Area: Social Aspects->Safety
Keywords: Safety Alignment, Safety Reasoning, Safety Generalization, OOD Generalization
Submission Number: 1486
Loading