MTJ-MCTS: Monte Carlo Tree Search based Multi-Turn Jailbreak Attacks against Large Language Models

MTJ-MCTS: Monte Carlo Tree Search based Multi-Turn Jailbreak Attacks against Large Language Models

ACL ARR 2025 May Submission5521 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The growing utilization of large language models (LLMs) across diverse domains presents a significant challenge in terms of ensuring their safety. Multi-turn jailbreak attacks are designed to identify vulnerabilities in LLMs by simulating the multi-turn interactions between users and models in real-world scenarios. However, existing approaches mainly rely on chain-based query decomposition, which fails to adequately explore potential attack paths and lacks effective strategies to guide the search process. To address these issues, we propose MTJ-MCTS, which constructs a Monte Carlo tree for each attack target in order to find a variety of effective attack paths. Specifically, we first select a series of single-turn attack prompts as attack targets. Through the interactions between an attacker model and a target model, we dynamically build a tree where each path from the root to a leaf node represents a complete attack path. During these interactions, we design process rewards based on the dialogue history between the attacker model and the target model to guide the exploration of attack paths. We assess the efficacy of transfer attacks utilizing the Monte Carlo Trees constructed by MTJ-MCTS on both open-source and proprietary models. Experimental results show that our approach is capable of more effectively and efficiently eliciting unexpected behaviors across all five large language models.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: security/privacy，NLP for social good

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 5521

Loading