Jailbreak Forests: Multi-Turn Jailbreaking via Reasoning Agents and Leaf-Guided Tree-Expanded Reinforcement Learning
Keywords: Multi-turn Jailbreak, Large Language Models, Red-teaming, Reinforcement Learning, Tree-based Planning, Attack Strategy Generation
Abstract: Multi-turn jailbreaks is an important threat for large language models (LLMs). However, existing multi-turn jailbreak approaches fall into two categories with complementary limitations. Template-based methods convert a single malicious prompt into a fixed sequence of templates that cannot dynamically adapt to the target model's responses. Optimization-based models can react to target output, but are trained only to maximize immediate jailbreak success and lack explicit planning mechanisms to select or compose strategies across turns.
Neither of them is equipped with the ability to reason over conversation history and perform multi-step planning or plan revise and update, limiting their red-teaming ability. To address them, we introduce JailPlanner, a multi-turn jailbreak agent with three stage process.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: adversarial attacks/examples/training, LLM/AI agents, reinforcement learning, safety and alignment, red teaming
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 2435
Loading