Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

ACL ARR 2026 January Submission3999 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM safety, red teaming, jailbreak

Abstract: This paper presents a novel Automated Red Teaming (ART) framework that shifts from example-based to policy-based evaluation, addressing critical limitations in scalability and validity. We define harmful content through abstract safety policies rather than specific static examples. We also introduce multiple evaluation objectives: risk coverage, semantic diversity, and fidelity, and discover Pareto trade-offs between them. We propose Jailbreak-Zero, a black-box method capable of both zero-shot generation and fine-tuned exploitation of a victim's vulnerabilities to achieve Pareto-optimality. Unlike prior approaches, it does not require expert-designed strategies/prompts, but still achieves superior, human-readable attacks against open-source and proprietary models (attack success rates of 99.5\% against GPT-4o and 96.0\% against Claude 3.5), even for unseen safety policies. It retains efficacy even after victim models undergo safety alignment, and exposes controls to navigate Pareto trade-offs \textit{without} re-training. Lastly, we show that Jailbreak-Zero is the most performant ART method at a given compute budget. \textcolor{red}{Trigger Warning: The appendix of this paper contains model behaviors that can be offensive in nature.}

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: Language Modeling

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 3999

Loading