Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-SA 4.0
TL;DR: We propose an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify key vulnerabilities in LLMs.
Abstract: Red teaming aims to assess how large language models (LLMs) can produce content that violates norms, policies, and rules set forth during their safety training. However, most existing automated methods in literature are not representative of the way common users exploit the multi-turn conversational nature of AI models. While manual testing addresses this gap, it is an inefficient and often expensive process. To address these limitations, we introduce the Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vuLnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks by prompting a general purpose model in a way that encourages reasoning through the choices of methods available, the current target model’s response, and the next steps. Our approach is designed to be extensible and efficient, allowing human testers to focus on exploring new areas of risk while automation covers the scaled adversarial stress-testing of known risk territory. We present the design and evaluation of GOAT, demonstrating its effectiveness in identifying vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 96% against smaller models such as Llama 3.1 8B, and 91% against Llama 3.1 70B and 94% for GPT-4o when evaluated against larger models on the JailbreakBench dataset.
Lay Summary: Red teaming is a way of stress-testing whether large language models (LLMs) can produce content that violates norms, policies, and rules. This helps inform model creators if safety gaps exist and should be remedied before the model is made broadly available and is a critical part of the so-called "training" process. Red teaming can be either performed manually by humans or automated with adversarial algorithms. However, some of the existing algorithms do not fully capture the way humans perform this testing as they never simulate conversations and only come up with a single adversarial question to the model. This work shows a simple-yet-effective method that allows automation to better simulate human red teamers. It does so by using another language model and instructing it to carry out an adversarial conversation with the model being tested. Our research contributes to the safety and responsible development of future AI systems by identifying vulnerabilities so that they can be remedied. By conducting controlled research to uncover these issues now, we proactively mitigate risks that could otherwise emerge during real-world deployments.
Primary Area: Social Aspects->Safety
Keywords: red teaming, adversarial machine learning, adversarial examples, attacks on language models
Submission Number: 8123
Loading