Many-Turn Jailbreaking

ACL ARR 2024 June Submission3310 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Current jailbreaking work on large language models (LLMs) aims to elicit unsafe outputs from given prompts. However, it only focuses on single-turn jailbreaking, given one specific query. On the contrary, the advanced LLMs are designed to handle extremely long contexts and can thus conduct multi-turn conversations. So, we propose exploring multi-turn jailbreaking, in which the jailbroken LLMs are continuously tested more than the initial conversation turn or a single target query. This is an even more serious challenge because 1) it is very common for users to continue asking relevant follow-up questions to clarify certain details, and 2) It is also possible that the initial round of jailbreaking causes the LLMs to consistently respond to additional irrelevant questions, rather than focusing on adversarially targeting each new query. %This approach significantly reduces the cost of jailbreaking. As the first step in exploring multi-turn jailbreaking, we construct a Multi-Turn Jailbreak Benchmark (**MTJ-Bench**) for benchmarking this setting and provide novel insights into this new safety threat. By revealing this new vulnerability, we aim to call for community efforts to build safer LLMs and pave the way for a more in-depth understanding of jailbreaking LLMs.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: security, red teaming, robustness
Contribution Types: NLP engineering experiment, Reproduction study, Data resources
Languages Studied: English
Submission Number: 3310
Loading