Keywords: Self-Evolving Agents Metacognition LLM Jailbreak Red-Teaming
Abstract: Red teaming is a critical mechanism for uncovering vulnerabilities in Large Language Models (LLMs). To scale this process beyond manual efforts, research has shifted towards automated red-teaming. However, existing automated red-teaming approaches are fundamentally limited by their reliance on static and predefined attack strategies. This strategic rigidity renders their attacks predictable and brittle, leading to a significant performance degradation when targeting today’s highly-aligned models. To overcome this limitation, we introduce a new paradigm framing red-teaming attacks from a static prompt-search problem into one of learning a self-evolving attack policy over a multi-turn conversation. Specifically, we propose Morpheus, an agent that operationalizes this paradigm by learning to attack via *self-evolving metacognition*. At each conversational turn, Morpheus engages in explicit metacognitive reasoning; it leverages feedback from an external Evaluator to critique its current strategy, diagnose the target’s defenses, and dynamically evolve its attack strategy. Extensive evaluations on 10 frontier models (including O1, GPT-5-chat, and Claude-3.7) behaviors demonstrate that Morpheus establishes a new state-of-the-art. It achieves superior generalization, maintaining high Attack Success Rates (ASR) of 76.0% on O1 and 78.0% on GPT-5-chat, outperforming leading multi-agent baselines by margins of 29% to 62% on difficult targets. Crucially, Morpheus achieves this robustness with remarkable efficiency, reducing token costs by 1.4$\times$ to 10.6$\times$ compared to search-based methods. Furthermore, analysis against 5 modern defenses reveals that Morpheus effectively penetrates static safety alignment by dynamically evolving its reasoning trajectory, highlighting a critical need for inference-time defense mechanisms.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 17914
Loading