Keywords: LLM safe; Red team; LLM jailbreak; Reinforcement learning
Abstract: As large language models (LLMs) grow in power and influence, ensuring their safety and preventing harmful output becomes critical. Automated red teaming serves as a tool to detect security vulnerabilities in LLMs without manual labor. However, most existing methods struggle to balance the effectiveness and diversity of red-team generated attack prompts. To address this challenge, we propose Jailbreak-R1, a novel automated red teaming training framework that utilizes reinforcement learning to explore and generate more effective attack prompts while balancing their diversity. Specifically, it consists of three training stages: (1) Cold Start: The red team model is supervised and fine-tuned on a jailbreak dataset obtained through imitation learning. (2) Warm-up Exploration: The model is trained in jailbreak instruction following and exploration, using diversity and consistency as reward signals. (3) Enhanced Jailbreak: Progressive jailbreak rewards are introduced to gradually enhance the jailbreak performance of the red-team model. Extensive experiments on a variety of LLMs show that Jailbreak-R1 improves jailbreak efficiency by an average of 28\% using only 34\% of the cost of other methods. With its diverse jailbreak space, Jailbreak-R1 is able to continuously increase its attack success rate during test-time scaling.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Ethics, Bias, and Fairness; safety and alignment
Contribution Types: NLP engineering experiment, Reproduction study, Data resources
Languages Studied: English
Submission Number: 4278
Loading