Keywords: LLM safety, red teaming, jailbreak
Abstract: This paper presents a novel Automated Red Teaming (ART) framework that shifts from example-based to policy-based evaluation, addressing critical limitations in scalability and validity. We define harmful content through abstract safety policies rather than specific static examples. We also introduce multiple evaluation objectives: risk coverage, semantic diversity, and fidelity, and discover Pareto trade-offs between them. We propose Jailbreak-Zero, a black-box method capable of both zero-shot generation and fine-tuned exploitation of a victim's vulnerabilities to achieve Pareto-optimality. Unlike prior approaches, it does not require expert-designed strategies/prompts, but still achieves superior, human-readable attacks against open-source and proprietary models (attack success rates of 99.5\% against GPT-4o and 96.0\% against Claude 3.5), even for unseen safety policies. It retains efficacy even after victim models undergo safety alignment, and exposes controls to navigate Pareto trade-offs \textit{without} re-training. Lastly, we show that Jailbreak-Zero is the most performant ART method at a given compute budget. \textcolor{red}{Trigger Warning: The appendix of this paper contains model behaviors that can be offensive in nature.}
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Language Modeling
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 3999
Loading