The Safety Illusion of Greedy Decoding: Diagnosing Booster's Compliant Leakage and a Phase-2 Mitigation

Published: 23 May 2026, Last Modified: 06 Jun 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Harmful fine-tuning attack, Harmful fine-tuning defense, LLM safety alignment
Abstract: Alignment-phase defenses against harmful fine-tuning attacks, such as the state-of-the-art Booster, are appealing because they pay a single upfront cost. We identify an overlooked failure mode of this method, \textbf{compliant leakage}: the defended model looks safe under greedy decoding but breaks down under stochastic sampling, the regime in which deployed LLMs operate. The cause is that probability mass silently drains from refusal tokens to compliant tokens after the attack while the argmax is preserved, a redistribution invisible to per-token single-step meta-regularizers. We propose a simple fine-tuning-phase defense that directly supervises this leakage across the full multi-step attack trajectory, and show it consistently reduces harmfulness under stochastic sampling. Our findings argue that an HFTA-robust model must also be safe under such regime.
Track: Short Paper (4 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 284
Loading