Evaluating off-the-shell LLMs’  Red-teaming Ability  for Multi-round Jailbreak Attack

Evaluating off-the-shell LLMs’ Red-teaming Ability for Multi-round Jailbreak Attack

ICLR 2026 Conference Submission19693 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, read-teaming, jailbreak

Abstract: Safety evaluation of large language models (LLMs) has emerged as a critical re- search frontier. To ensure comprehensive evaluation, current practices often in- volve curating task-specific benchmark datasets tailored to distinct application scenarios. However, such dataset-centric approaches suffer from two fundamental limitations: poor transferability across domains and temporal obsolescence due to the evolving nature of LLMs. To overcome these limitations, an intuitive idea is to leverage off-the-shelf LLMs as red teams. Yet, a pivotal question remains under-explored: Can off-the-shelf LLMs conduct autonomous and effective secu- rity evaluations without specialized red team training? Motivated by this question, we further raise the bar by focusing on multi-round jailbreaking attacks, which de- mand deeper strategic reasoning and intent concealment compared to single-round adversarial prompts. Unlike traditional red team evaluation methods for LLMs, which focus on assessing the robustness and security of these models, Our method aims to leverage the inherent capabilities of off-the-shelf LLMs to evaluate their potential for cross-scenario transfer and iterative evolution over time during red team testing. Specifically, we evaluate the red-teaming capabilities of six off-the- shelf LLMs across five major and ten secondary harmful categories. Experimental results indicate that these models exhibit non-trivial proficiency in performing ef- fective multi-turn attacks, often employing known jailbreaking techniques such as role-playing, indirect prompting, and semantic decomposition. Nevertheless, significant limitations persist. Based on our findings, we discuss actionable direc- tions for enhancing the effectiveness of red-team LLMs, as well as implications for strengthening the robustness of victim models.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 19693

Loading