Silent Refusal Planning: Understanding Shallow Safety Alignment Through the Planning and Behavior Gap

Silent Refusal Planning: Understanding Shallow Safety Alignment Through the Planning and Behavior Gap

ACL ARR 2026 January Submission8076 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Refusal Planning, Large Language Models, Safety Alignment

Abstract: Large language models (LLMs) are trained under the next-token prediction paradigm. However, recent studies show that their hidden states encode information about future outputs beyond the next token, also known as planning. In this work, we study planning from a safety perspective and examine whether LLMs possess refusal planning in scenarios involving refusal. We probe the hidden states of the model and our results reveal that well-formed refusal planning exists in both safety-aligned chat models and unaligned base models. Despite this internal capability, both the chat and base models exhibit a gap between their planning and behavior, a phenomenon we term silent refusal planning. We show that safety alignment vulnerabilities across multiple security scenarios—including malicious instructions, over-refusal, jailbreak attacks, and the absence of chat templates—may be associated with silent refusal planning. To mitigate these issues, we propose a heuristic that converts internal refusal planning into explicit refusal behavior. Experimental results indicate that leveraging the inherent safety capabilities of LLMs substantially improves safety and robustness, reducing attack success rates by up to around 80% in jailbreak settings.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: safety and alignment, robustness,

Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 8076

Loading