Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

ICLR 2026 Conference Submission470 Authors

01 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Jailbreak Defense, Self-Alignment, Intrinsic Safety
Abstract: Ensuring Large Language Model (LLM) safety remains challenging due to the absence of universal standards and reliable content validators, making it difficult to obtain effective training signals. We discover that aligned models already possess robust internal safety beliefs: they consistently produce high-confidence refusals to harmful requests while exhibiting high entropy when generating potentially dangerous content. This entropy gap reveals an untapped signal—models intrinsically "know" when to refuse. We introduce Safety Instincts Reinforcement Learning (*SIRL*), which transforms this internal confidence into a self-generated reward signal, eliminating dependence on external validators or human annotations. *SIRL* teaches models to trust their safety instincts by reinforcing low-entropy refusal behaviors. Evaluated on Llama and Qwen models, *SIRL* maintains 89\%+ Defense Success Rates (DSRs) against 20+ jailbreak methods, from static prompts to automated attacks. Using only 15,000 unlabeled prompts, *SIRL* surpasses resource-intensive supervised methods while preserving performance on mathematics, coding, and conversation benchmarks. Our work demonstrates that effective alignment can emerge from within, paving the way for more autonomous and robust AI safety mechanisms that scale without extensive human oversight.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 470
Loading