LLM Jailbreaks Exploit Attention Sinks

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Interpretability for AI Safety, Methods (probing, steering, causal interventions), Feature Geometry
TL;DR: Suffix-based jailbreaks induce attention sinks that suppress the model's refusal direction, and modulating sink influence shifts attack success by up to 276%.
Abstract: Suffix-based jailbreak attacks append adversarial token sequences to harmful requests, bypassing safety guardrails in language models. Despite their effectiveness, the mechanisms enabling these attacks remain poorly understood. We find that tokens in adversarial suffixes are prone to inducing *attention sinks*---a phenomenon where certain tokens (e.g., BOS, punctuation, and chat tokens) receive disproportionately high attention from subsequent tokens---and establish a relationship between suffix-induced sinks and attack success: amplifying the influence of suffix sinks improves attack success by up to 276\%, while attenuating it reduces attack success by up to 84\%. We trace this effect to the model's *refusal direction*: sink tokens induce perturbations aligned with the refusal direction, cumulatively suppressing the residual stream's refusal alignment across layers. Our results generalize across several models and suffix-based jailbreak methods, exposing a fundamental structural vulnerability in transformer attention mechanisms that adversarial suffixes exploit to bypass safety alignment.
Submission Number: 678
Loading