Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

Mikayel Samvelyan; Sharath Chandra Raparthy; Andrei Lupu; Eric Hambro; Aram H. Markosyan; Manish Bhatt; Yuning Mao; Minqi Jiang; Jack Parker-Holder; Jakob Nicolaus Foerster; Tim Rocktäschel; Roberta Raileanu

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Nicolaus Foerster, Tim Rocktäschel, Roberta Raileanu

Published: 04 Mar 2024, Last Modified: 14 Apr 2024SeT LLM @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: open-endedness, diversity, red teaming, large language models, foundational models

TL;DR: We introduce Rainbow Teaming, a novel approach for generating diverse adversarial prompts for LLMs, which can be used for diagnosing LLMs for robustness/safety and generating synthetic data.

Abstract: As large language models (LLMs) become increasingly prevalent across many real-world applications, understanding and enhancing their robustness to user inputs is of paramount importance. Existing methods for identifying adversarial prompts tend to focus on specific domains, lack diversity, or require extensive human annotations. To address these limitations, we present Rainbow Teaming, a novel approach for producing a diverse collection of adversarial prompts. Rainbow Teaming casts adversarial prompt generation as a quality-diversity problem, and uses open-ended search to generate prompts that are both effective and diverse. It can uncover a model’s limitations across a broad range of domains including safety, question answering, and cybersecurity, which we show empirically. We also demonstrate that fine-tuning on synthetic data generated by Rainbow Teaming improves the safety of state-of-the-art LLMs without hurting their other capabilities, paving the path to open-ended self-improvement.

Submission Number: 38

Loading