Incremental Exploits: Efficient Jailbreaks on Large Language Models with Multi-round Conversational Jailbreaking
Keywords: Large Language Models, Model Vulnerabilities, Multi-round Conversational Jailbreaking
TL;DR: We introduce Multi-round Conversational Jailbreaking (MRCJ), a method that exploits LLMs' contextual consistency in extended conversations to bypass safety mechanisms, achieving over 90% jailbreak success with fewer than five queries on average.
Abstract: As large language models (LLMs) become widely deployed across various domains, security concerns---particularly jailbreak attacks that bypass built-in safety mechanisms---have emerged as significant risks. Existing jailbreak methods focus mainly on single-turn interactions and face limitations in generalizability and practicality. In this paper, we propose a novel method called Multi-round Conversational Jailbreaking (MRCJ), which exploits the unintended competition between a LLMs' safety alignment and its in-context learning objectives during extended conversations. By incrementally introducing increasingly malicious content, the LLMs' tendency to maintain contextual consistency can override its safety protocols, ultimately leading to harmful outputs. To facilitate conversation flow generation, we developed a dataset containing 12,000 questions, categorized into six types of security topics, and classified across four levels of severity, spanning ten languages. Compared to existing methods, MRCJ demonstrates superior efficiency, applicability, and effectiveness by fully exploiting the potential of multi-round conversations. In experiments, MRCJ achieves a jailbreak success rate of over 90\% across widely-used LLMs, requiring fewer than five queries on average, and significantly outperforms baselines on both metrics. Our findings expose vulnerabilities in current LLMs during extended conversations and highlight the need for improved safety mechanisms that consider multi-round interactions. The source code and dataset are available at (URL omitted for double-blind reviewing; code available in supplementary materials).
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9265
Loading