Abstract: The growing utilization of large language models (LLMs) across diverse domains presents a significant challenge in terms of ensuring their safety. Multi-turn jailbreak attacks are designed to identify vulnerabilities in LLMs by simulating the multi-turn interactions between users and models in real-world scenarios. However, existing approaches mainly rely on chain-based query decomposition, which fails to adequately explore potential attack paths and lacks effective strategies to guide the search process. To address these issues, we propose MTJ-MCTS, which constructs a Monte Carlo tree for each attack target in order to find a variety of effective attack paths. Specifically, we first select a series of single-turn attack prompts as attack targets. Through the interactions between an attacker model and a target model, we dynamically build a tree where each path from the root to a leaf node represents a complete attack path. During these interactions, we design process rewards based on the dialogue history between the attacker model and the target model to guide the exploration of attack paths. We assess the efficacy of transfer attacks utilizing the Monte Carlo Trees constructed by MTJ-MCTS on both open-source and proprietary models. Experimental results show that our approach is capable of more effectively and efficiently eliciting unexpected behaviors across all five large language models.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: security/privacy,NLP for social good
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 5521
Loading