Keywords: LLM, read-teaming, jailbreak
Abstract: Safety evaluation of large language models (LLMs) has emerged as a critical re-
search frontier. To ensure comprehensive evaluation, current practices often in-
volve curating task-specific benchmark datasets tailored to distinct application
scenarios. However, such dataset-centric approaches suffer from two fundamental
limitations: poor transferability across domains and temporal obsolescence due
to the evolving nature of LLMs. To overcome these limitations, an intuitive idea
is to leverage off-the-shelf LLMs as red teams. Yet, a pivotal question remains
under-explored: Can off-the-shelf LLMs conduct autonomous and effective secu-
rity evaluations without specialized red team training? Motivated by this question,
we further raise the bar by focusing on multi-round jailbreaking attacks, which de-
mand deeper strategic reasoning and intent concealment compared to single-round
adversarial prompts. Unlike traditional red team evaluation methods for LLMs,
which focus on assessing the robustness and security of these models, Our method
aims to leverage the inherent capabilities of off-the-shelf LLMs to evaluate their
potential for cross-scenario transfer and iterative evolution over time during red
team testing. Specifically, we evaluate the red-teaming capabilities of six off-the-
shelf LLMs across five major and ten secondary harmful categories. Experimental
results indicate that these models exhibit non-trivial proficiency in performing ef-
fective multi-turn attacks, often employing known jailbreaking techniques such
as role-playing, indirect prompting, and semantic decomposition. Nevertheless,
significant limitations persist. Based on our findings, we discuss actionable direc-
tions for enhancing the effectiveness of red-team LLMs, as well as implications
for strengthening the robustness of victim models.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 19693
Loading