everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
As large language models (LLMs) become increasingly capable, robust and scalable security evaluation is crucial. While current red teaming approaches have made strides in assessing LLM vulnerabilities, they often rely heavily on human input and fail to provide comprehensive coverage of potential risks. This paper introduces AutoRedTeamer, a unified framework for fully automated, end-to-end red teaming against LLMs. AutoRedTeamer is an LLM-based agent architecture comprising five specialized modules and a novel memory-based attack selection mechanism, enabling deliberate exploration of new attack vectors. AutoRedTeamer supports both seed prompt and risk category inputs, demonstrating flexibility across red teaming scenarios. We demonstrate AutoRedTeamer’s superior performance in identifying potential vulnerabilities compared to existing manual and optimization-based approaches, achieving higher attack success rates by 20% on HarmBench against Llama-3.1-70B while reducing computational costs by 46%. Notably, AutoRedTeamer can break jailbreaking defenses and generate test cases with comparable diversity to human-curated benchmarks. AutoRedTeamer establishes the state of the art for automating the entire red teaming pipeline, a critical step towards comprehensive and scalable security evaluations of AI systems.