Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction
Abstract: Red teaming is an effective approach for identifying misaligned behaviors in large language models (LLMs). Existing red teaming typically involves the manual creation of test cases by organized human teams, but the prohibitive costs significantly constrain the scalability of these tests. Recent initiatives have sought to automate red teaming for target language models by training a separate language model. However, most of them are limited to single-turn red teaming and only generate test cases with a limited coverage. For the long-tail issue of LLMs' safety, we believe that an optimal automated red teaming should encompass both breadth and depth. To this end, we introduce \textbf{HARM}, \textbf{H}olistic \textbf{A}utomated \textbf{R}ed tea\textbf{M}ing, which generates test prompts top-down using an expandable and fine-grained risk taxonomy to cover as many edge cases as possible, and leverages reinforcement learning for multi-turn adversarial probing. Experimental results indicate that our framework can be utilized to systematically uncover the vulnerabilities of models and offer valuable guidance for the alignment process.
Paper Type: long
Research Area: Ethics, Bias, and Fairness
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
0 Replies
Loading