Jailbreak Forests: Multi-Turn Jailbreaking via Reasoning Agents and Leaf-Guided Tree-Expanded Reinforcement Learning

Jailbreak Forests: Multi-Turn Jailbreaking via Reasoning Agents and Leaf-Guided Tree-Expanded Reinforcement Learning

ACL ARR 2026 January Submission2435 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-turn Jailbreak, Large Language Models, Red-teaming, Reinforcement Learning, Tree-based Planning, Attack Strategy Generation

Abstract: Multi-turn jailbreaks is an important threat for large language models (LLMs). However, existing multi-turn jailbreak approaches fall into two categories with complementary limitations. Template-based methods convert a single malicious prompt into a fixed sequence of templates that cannot dynamically adapt to the target model's responses. Optimization-based models can react to target output, but are trained only to maximize immediate jailbreak success and lack explicit planning mechanisms to select or compose strategies across turns. Neither of them is equipped with the ability to reason over conversation history and perform multi-step planning or plan revise and update, limiting their red-teaming ability. To address them, we introduce JailPlanner, a multi-turn jailbreak agent with three stage process.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: adversarial attacks/examples/training, LLM/AI agents, reinforcement learning, safety and alignment, red teaming

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 2435

Loading