MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Stress-testing LLM Safety

ICLR 2026 Conference Submission20915 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: jailbreak, safety, large language models, LLM, multi-turn, benchmark, dataset, robustness, vulnerability
TL;DR: We introduce MultiBreak, a scalable and diverse multi-turn jailbreak benchmark, built with active learning to stress-test LLM safety.
Abstract: We present MultiBreak, a scalable and diverse multi-turn jailbreak benchmark to stress-test large language model (LLM) safety. Multi-turn jailbreaks mimic natural conversational settings, making them easier to bypass safety-aligned LLM than single-turn jailbreaks. Existing multi-turn benchmarks are limited in size or rely heavily on templates, which restrict their diversity and realism. To address this gap, we unify a wide range of harmful jailbreak intents, and introduce an active learning pipeline for expanding high-quality multi-turn adversarial prompts. In this pipeline, a jailbreak attack generator is iteratively fine-tuned to produce stronger attack candidates, guided by uncertainty-based refinement. Our MultiBreak includes 7,152 multi-turn adversarial prompts, spans 1,724 distinct harmful intents, and covers the most diverse set of topics to date. Empirical evaluation shows that our benchmark achieves up to a 54.1% and 30.8% higher attack success rate (ASR) than the second-best dataset on DeepSeek-R1-7B and GPT-4.1-mini, respectively. More importantly, stress-testing reveals that LLMs resist overt harms (e.g., harassment) more effectively than subtle harms (e.g., high-stakes advice), yet remain highly vulnerable to framing-based attacks. These findings highlight persistent vulnerabilities of LLMs under realistic adversarial settings and establish MultiBreak as a scalable resource for advancing LLM safety research.
Primary Area: datasets and benchmarks
Submission Number: 20915
Loading