SeqRL: Sequence-Attentive Reinforcement Learning for LLM Jailbreaking

SeqRL: Sequence-Attentive Reinforcement Learning for LLM Jailbreaking

ICLR 2026 Conference Submission14556 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Jailbreaking, LLM, Safety, Reinforcement Learning

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, underscoring the importance of ensuring their safety and robustness. Recent work has examined jailbreaking attacks that bypass safeguards, but most methods either rely on access to model internals or depend on heuristic prompt designs, limiting general applicability. Reinforcement learning (RL)-based approaches address some of these issues, yet they often require many interaction steps and overlook vulnerabilities revealed in earlier turns. We propose a novel RL-based jailbreak framework that explicitly analyzes and reweights vulnerabilities from prior steps, enabling more efficient attacks with fewer queries. We first show that simply leveraging historical information already improves jailbreak success. Building on this insight, we introduce an attention-based reweighting mechanism that adaptively highlights critical vulnerabilities within the interaction history. Through comprehensive evaluations on the AdvBench benchmark, our method achieves state-of-the-art performance, demonstrating higher effectiveness in jailbreak success and greater efficiency in query usage. These findings emphasize the value of incorporating historical vulnerability signals into RL-driven jailbreak strategies, offering a general and effective pathway for advancing adversarial research on LLM safeguards.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 14556

Loading