Keywords: Jailbreak, LRM, Attention, RL-Based
Abstract: Large Reasoning Models (LRMs) have demonstrated exceptional capabilities in solving complex problems by producing structured, step-by-step reasoning. However, exposing a model’s internal reasoning process introduces additional safety risks; for example, recent studies show that LRMs are more vulnerable to jailbreak attacks than standard LLMs. In this paper, we investigate the unique risk properties of LRMs and reveal that jailbreak attack success rate (ASR) is closely correlated with the ratio of attention scores assigned to harmful tokens in the reasoning content relative to those in the original prompt: the larger this ratio, the higher the ASR. Motivated by this finding, we propose a novel LRM jailbreak method that leverages reinforcement learning (RL) to improve attack effectiveness, explicitly incorporating the attention-ratio signal into the reward design. Moreover, we introduce diverse persuasion strategies to expand the RL action space, which consistently enhances ASR. Extensive experiments on five open-source and closed-source LRMs across three benchmarks demonstrate that our method achieves substantially higher ASR and outperforms existing approaches in effectiveness, stealthiness, efficiency, and transferability.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: safety and alignment, red teaming
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 9433
Loading