Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

ACL ARR 2026 January Submission9433 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Jailbreak, LRM, Attention, RL-Based

Abstract: Large Reasoning Models (LRMs) have demonstrated exceptional capabilities in solving complex problems by producing structured, step-by-step reasoning. However, exposing a model’s internal reasoning process introduces additional safety risks; for example, recent studies show that LRMs are more vulnerable to jailbreak attacks than standard LLMs. In this paper, we investigate the unique risk properties of LRMs and reveal that jailbreak attack success rate (ASR) is closely correlated with the ratio of attention scores assigned to harmful tokens in the reasoning content relative to those in the original prompt: the larger this ratio, the higher the ASR. Motivated by this finding, we propose a novel LRM jailbreak method that leverages reinforcement learning (RL) to improve attack effectiveness, explicitly incorporating the attention-ratio signal into the reward design. Moreover, we introduce diverse persuasion strategies to expand the RL action space, which consistently enhances ASR. Extensive experiments on five open-source and closed-source LRMs across three benchmarks demonstrate that our method achieves substantially higher ASR and outperforms existing approaches in effectiveness, stealthiness, efficiency, and transferability.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: safety and alignment, red teaming

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 9433

Loading