Silent Refusal Planning: Understanding Shallow Safety Alignment Through the Planning and Behavior Gap
Keywords: Refusal Planning, Large Language Models, Safety Alignment
Abstract: Large language models (LLMs) are trained under the next-token prediction paradigm. However, recent studies show that their hidden states encode information about future outputs beyond the next token, also known as planning. In this work, we study planning from a safety perspective and examine whether LLMs possess refusal planning in scenarios involving refusal. We probe the hidden states of the model and our results reveal that well-formed refusal planning exists in both safety-aligned chat models and unaligned base models. Despite this internal capability, both the chat and base models exhibit a gap between their planning and behavior, a phenomenon we term silent refusal planning. We show that safety alignment vulnerabilities across multiple security scenarios—including malicious instructions, over-refusal, jailbreak attacks, and the absence of chat templates—may be associated with silent refusal planning. To mitigate these issues, we propose a heuristic that converts internal refusal planning into explicit refusal behavior. Experimental results indicate that leveraging the inherent safety capabilities of LLMs substantially improves safety and robustness, reducing attack success rates by up to around 80% in jailbreak settings.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: safety and alignment, robustness,
Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 8076
Loading