AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: unlearning, security, safety, jailbreaking, llm
TL;DR: AegisLLM is a test-time multi-agent defense that adaptively protects LLMs from security threats like jailbreaking and sensitive information disclosure without retraining.
Abstract: Large Language Models (LLMs) are vulnerable to a range of threats, from adversarial attacks like jailbreaking to the leakage of sensitive information that should have been unlearned. Existing defense mechanisms are often static and require extensive model retraining, making them slow to adapt to evolving threats. To address this, we propose \ours, a cooperative multi-agent framework that provides adaptive, test-time defense for LLMs. At the core of \ours is a multi-agent system that can be optimized with a remarkably small number of examples to achieve strong performance on multiple, distinct security challenges. We demonstrate the effectiveness of \ours on two critical and distinct threats: unlearning and jailbreaking. On the WMDP unlearning benchmark, \ours achieves near-perfect unlearning (approaching 25\% random chance) with only 20 training examples. For jailbreaking, our system improves defense by 51\% over the base model on the StrongReject benchmark, while maintaining a low false refusal rate of only 7.9\% on PHTest. Furthermore, we show that prompts optimized on one benchmark generalize effectively to others, underscoring the robustness of our approach. Our work highlights the significant advantages of adaptive, agentic reasoning and demonstrates the power of optimization for creating scalable and efficient LLM safety solutions. We provide our code at: https://anonymous.4open.science/r/aegisllm-11B0.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22988
Loading