Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning

Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning

ACL ARR 2026 January Submission4278 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM safe; Red team; LLM jailbreak; Reinforcement learning

Abstract: As large language models (LLMs) grow in power and influence, ensuring their safety and preventing harmful output becomes critical. Automated red teaming serves as a tool to detect security vulnerabilities in LLMs without manual labor. However, most existing methods struggle to balance the effectiveness and diversity of red-team generated attack prompts. To address this challenge, we propose Jailbreak-R1, a novel automated red teaming training framework that utilizes reinforcement learning to explore and generate more effective attack prompts while balancing their diversity. Specifically, it consists of three training stages: (1) Cold Start: The red team model is supervised and fine-tuned on a jailbreak dataset obtained through imitation learning. (2) Warm-up Exploration: The model is trained in jailbreak instruction following and exploration, using diversity and consistency as reward signals. (3) Enhanced Jailbreak: Progressive jailbreak rewards are introduced to gradually enhance the jailbreak performance of the red-team model. Extensive experiments on a variety of LLMs show that Jailbreak-R1 improves jailbreak efficiency by an average of 28\% using only 34\% of the cost of other methods. With its diverse jailbreak space, Jailbreak-R1 is able to continuously increase its attack success rate during test-time scaling.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: Ethics, Bias, and Fairness; safety and alignment

Contribution Types: NLP engineering experiment, Reproduction study, Data resources

Languages Studied: English

Submission Number: 4278

Loading