Your Large Reasoning Models Can Be Safer on Its Own

Your Large Reasoning Models Can Be Safer on Its Own

ICLR 2026 Conference Submission15929 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Safety Alignment, Large Reasoning Models

Abstract: Large Reasoning Models (LRMs) have demonstrated outstanding capabilities in both general and complex tasks. However, when confronted with carefully crafted jailbreaking queries or even direct harmful queries, they still have a high probability of generating unsafe content, posing serious security risks. Ensuring the safety of LRMs has become equally critical as their performance in applications. This paper reveals the **Latent Safety Awareness** inherent in LRMs. When the LRMs can simultaneously perceive both the original risk queries and its own reasoning path, its probability to identify the safety of core issues and its own reasoning vulnerabilities will be significantly improved and proactively recommend refusing to continue generating potentially harmful answers. Based on this phenomenon, the **Safe Trigger** approach is proposed, which employs a structured triggering mechanism to explicitly activate this capability. The approach introduces a supervised fine-tuning strategy to ensure efficient triggering in response to risky queries while remaining restrained for general queries. Furthermore, a preference optimization paradigm is incorporated to enhance the guiding power and stability of the safety analysis in shaping the final output. Experimental results show that Safe Trigger approach significantly strengthens the model’s safety alignment while exerting almost no impact on its general performance or user experience. Moreover, the entire training process relies solely on the model’s own generation and reasoning capabilities, requiring neither manual annotation nor more powerful closed-source models, offering a low-cost, highly stable, and scalable solution.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 15929

Loading