Imitation Guided Automated Red Teaming

Sajad Mousavi; Desik Rengarajan; Ashwin Ramesh Babu; Vineet Gundecha; Antonio Guillen; Ricardo Luna Gutierrez; Avisek Naug; Sahand Ghorbanpour; Soumyendu Sarkar

Imitation Guided Automated Red Teaming

Sajad Mousavi, Desik Rengarajan, Ashwin Ramesh Babu, Vineet Gundecha, Antonio Guillen, Ricardo Luna Gutierrez, Avisek Naug, Sahand Ghorbanpour, Soumyendu Sarkar

Published: 15 Oct 2024, Last Modified: 29 Dec 2024AdvML-Frontiers 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Automated Red-teaming, Large Language Models (LLMs), Reinforcement Learning, Imitation Learning

TL;DR: A new computationally efficient imitation-guided reinforcement learning approach for red teaming (iART) LLMs

Abstract: The potential of large language models (LLMs) is substantial, yet they also carry the risk of generating harmful responses. An automatic "red teaming" process constructs test cases designed to elicit unfavorable responses from these models. A successful generator must provoke undesirable responses from the target LLMs with test cases that exemplify diversity. Current methods often struggle to balance quality (i.e., the harmfulness of responses) and diversity (i.e., the range of scenarios) in testing, typically sacrificing one to enhance the other, and relying on non-optimal exhaustive comparison approaches. To address these challenges, we introduce an imitation-guided reinforcement learning approach to learn optimal red teaming strategies that generate both diverse and high-quality test cases without exhaustive searching. Our proposed method, Imitation-guided Automated Red Teaming (iART), is evaluated across various LLMs fine-tuned for different tasks. We demonstrate that iART achieves not only diverse test sets but also elicits undesirable responses from the target LLM in a computationally efficient manner.

Submission Number: 11

Loading