Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety
Keywords: LLM safety, adversarial testing, red teaming, quality-diversity, MAP-Elites, evolutionary algorithms, jailbreak, semantic attacks
TL;DR: We use MAP-Elites quality-diversity evolution to discover diverse LLM vulnerabilities at the semantic strategy level, revealing distinct weakness profiles across GPT-4o-mini, Claude 3.5, and Gemini 2.0.
Abstract: Current approaches to LLM adversarial testing suffer from coverage gaps: manual red-teaming does not scale, LLM-as-attacker methods exhibit mode collapse, and gradient-based approaches produce uninterpretable gibberish. We introduce a quality-diversity evolutionary framework that operates at the semantic level, evolving interpretable attack strategies rather than token sequences. Using MAP-Elites, we maintain a diverse archive of attacks across behavioral dimensions (strategy type, encoding method). In experiments across GPT-4o-mini, Claude 3.5 Sonnet, and Gemini 2.0 Flash, we discover distinct vulnerability profiles: GPT-4o-mini is vulnerable to hypothetical and multi-turn framing (fitness 0.8), Gemini to direct attacks with ROT13 encoding (0.8), while Claude shows robust refusal across all strategies (max 0.4). The semantic representation produces interpretable attacks that reveal systematic weaknesses, providing actionable insights for improving LLM safety.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 69
Loading