Keywords: Adversarial Prompt Generation, Large Language Models (LLMs), Red-Teaming, Quality-Diversity Optimization, Evolutionary Algorithms, Multi-objective Optimization, LLM Safety and Robustness, Multi-element Archive, Probabilistic Fitness Evaluation, Mutation-based Prompt Search
Abstract: Large Language Models (LLMs) remain vulnerable to adversarial prompts that exploit safety mechanisms. Existing red-teaming methods face scalability challenges, computational bottlenecks, or limited attack diversity. We propose \rainbowplus{}, a framework that reconceptualizes adversarial prompt generation as evolutionary quality-diversity search, where diverse attack strategies co-evolve across behavioral niches. \rainbowplus{} introduces two synergistic innovations: (1) \textit{multi-element archives} that maintain populations of elite solutions per niche, and (2) \textit{parallel fitness evaluation} that replaces pairwise comparisons with efficient probabilistic scoring, achieving $\Theta(M)$ speedup (from $\Theta(M^2N)$ to $\Theta(MN)$). Experiments demonstrate superior performance: compared to Rainbow Teaming, \rainbowplus{} generates $100\times$ more unique prompts (10,418 vs. 100) with higher attack success rates (95.55\% vs. 54.36\% on Ministral-8B). Against nine state-of-the-art methods on HarmBench with 12 LLMs, \rainbowplus{} achieves 81.1\% average ASR - surpassing AutoDAN-Turbo by 3.9 points - while being $9\times$ faster (1.45 vs. 13.50 hours). Code: https://anonymous.4open.science/r/rainbowplus-E0EF/
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Interpretability and Analysis of Models for NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Theory
Languages Studied: English
Submission Number: 5951
Loading