AGA: Attention-Guided Jailbreak Attacks on Large Reasoning Models

16 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Jailbreak, LRMs, Attention
Abstract: Large Reasoning Models (LRMs) are known for their exceptional ability to solve complex problems and provide structured solutions through step-by-step reasoning. However, this powerful reasoning capability also introduces new security risks. Existing jailbreak methods exploit the model's explicit reasoning process by fabricating reasoning steps to manipulate its output, leading it to generate harmful or biased content. Although these methods are effective, the underlying reasons for their success remain unclear. In this paper, we first analyze both successful and failed jailbreak attempts and find that successful attacks effectively shift the model's attention away from harmful keywords, redirecting it toward other parts of the prompt or its internal reasoning process. Based on this, we propose AGA, a novel and efficient attention-guided jailbreak method that leverages model's intermediate reasoning steps to iteratively refine candidate prompts. Extensive experiments on five open-source and closed-source LRMs across three datasets demonstrate that our method achieves remarkable attack success rate and outperforms existing methods in terms of stealthiness, efficiency, and transferability. Our research highlights the urgent need for improved safety measures tailored to LRMs.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 6688
Loading