Deep Safety Alignment Requires Thinking Beyond the Top Token

Jason Vega; Gagandeep Singh

Deep Safety Alignment Requires Thinking Beyond the Top Token

Jason Vega, Gagandeep Singh

20 Sept 2025 (modified: 02 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, safety alignment, prefilling attacks, exploit

TL;DR: We find that a SFT-based data augmentation approach to deep safety alignment still exhibits safety vulnerabilities against a more general form of the prefilling attack, and propose a simple fix.

Abstract: Despite extensive efforts and investments in the safety alignment of Large Language Models (LLMs), prior work has shown that the alignment of frontier LLMs can be circumvented by prefilling the assistant response with an affirmative prefix -- a frustratingly easy exploit with no fine-tuning or costly jailbreak algorithms required. In response, a simple supervised fine-tuning (SFT) procedure using data augmentation was recently shown to be surprisingly effective at achieving a "deeper" safety alignment that yields natural language refusals to harmful prefilling attacks. In this work, we show that the "deep" safety alignment resulting from this data augmentation approach is in fact not very deep. We find that a failure mode of the SFT-based data augmentation objective "shortcuts" the learning of deep safety alignment by placing nearly all of the probability mass on a single refusal token while allowing harmful tokens to still appear within the top 20 tokens at each generation step. Thus, the safety alignment can still be easily circumvented by selecting from these harmful tokens in what we call a Rank-Assisted Prefilling (RAP) attack. We then propose a new perspective on achieving deep safety alignment based on "pushing forward" the first response token distributions to harmful requests, where the top 20+ tokens tend to all be refusal tokens due to the absence of a prefill. This yields a surprisingly simple fix to the data augmentation approach based on regularizing the attention placed on harmful prefill tokens, a technique we refer to as PRefill attEntion STOpping (PRESTO). Through both human and automated evaluations, we find that PRESTO significantly improves robustness against RAP attacks, with minimal impact to the utility of the model.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 24256

Loading