Keywords: LLM red teaming, LLM jailbreak, LLM safety
TL;DR: A strong red teaming method for LLM safety and a novel and comprehensive evaluation framework for LLM red teaming
Abstract: This paper advances Automated Red Teaming (ART) for evaluating Large Language Model (LLM) safety through both methodological and evaluation contributions. We first analyze existing example-based red teaming approaches and identify critical limitations in scalability and validity, and propose a policy-based evaluation framework that defines harmful content through safety policies rather than examples. This framework incorporates multiple objectives beyond attack success rate (ASR), including risk coverage, semantic diversity, and fidelity to desired data distributions. We then analyze the Pareto trade-offs between these objectives. Our second contribution, Jailbreak-Zero, is a novel ART method that adapts to this evaluation framework. Jailbreak-Zero can be a zero-shot method that generates successful jailbreak prompts with minimal human input, or a fine-tuned method where the attack LLM explores and exploits the vulnerabilities of a particular victim to achieve Pareto-optimality. Moreover, it exposes controls to navigate Pareto trade-offs as required by a use case without re-training.
Jailbreak-Zero achieves superior attack success rates with human-readable attacks compared to prior methods while maximizing semantic diversity and distribution fidelity. Our results generalize across both open-source (Llama, Qwen, Mistral) and proprietary models (GPT-4o and Claude 3.5). Lastly, our method retains efficacy even after the LLM that we are red-teaming undergoes safety alignment to mitigate the risks exposed by a previous round of red teaming.
Submission Number: 17
Loading