Keywords: adversarial robustness, large language models, jailbreaking, ai agents
TL;DR: We propose a LLM agent framework to automate red teaming
Abstract: As large language models (LLMs) become increasingly capable, robust and scalable security evaluation is crucial. While current red teaming approaches have made strides in assessing LLM vulnerabilities, they often rely heavily on human input and fail to provide comprehensive coverage of potential risks. This paper introduces AutoRedTeamer, a unified framework for fully automated, end-to-end red teaming against LLMs. AutoRedTeamer is an LLM-based agent architecture comprising five specialized modules and a novel memory-based attack selection mechanism, enabling deliberate exploration of new attack vectors. AutoRedTeamer supports both seed prompt and risk category inputs, demonstrating flexibility across red teaming scenarios. We demonstrate AutoRedTeamer’s superior performance in identifying potential vulnerabilities compared to existing manual and optimization-based approaches, achieving higher attack success rates by 20% on HarmBench against Llama-3.1-70B while reducing computational costs by 46%. Notably, AutoRedTeamer can break jailbreaking defenses and generate test cases with comparable diversity to human-curated benchmarks. AutoRedTeamer establishes the state of the art for automating the entire red teaming pipeline, a critical step towards comprehensive and scalable security evaluations of AI systems.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9354
Loading