Morpheus: Learning to Jailbreak via Self-Evolving Metacognition

ICLR 2026 Conference Submission17914 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Self-Evolving Agents Metacognition LLM Jailbreak Red-Teaming
Abstract: Red teaming is a critical mechanism for uncovering vulnerabilities in Large Language Models (LLMs). To scale this process beyond manual efforts, research has shifted towards automated red-teaming. However, existing automated red-teaming approaches are fundamentally limited by their reliance on static and predefined attack strategies. This strategic rigidity renders their attacks predictable and brittle, leading to a significant performance degradation when targeting today’s highly aligned models. To overcome this limitation, we introduce a new paradigm framing red-teaming attacks from a static prompt-search problem into one of learning a self-evolving attack policy over a multi-turn conversation. Specifically, we propose Morpheus, an agent that operationalizes this paradigm by learning to attack via self-evolving metacognition. At each conversational turn, Morpheus engages in explicit metacognitive reasoning; it leverages feedback from an external Evaluator to critique its current strategy, diagnose the target’s defenses, and dynamically evolve its attack plan. Our learning-based approach demonstrates state-of-the-art efficacy, outperforming leading methods by substantial margins of 42% to 62% on frontier models such as Claude-3.7 and O1. Furthermore, scaling analysis highlights Morpheus’s learning capacity, as it achieves near-perfect Attack Suc- cess Rates (ASR) of 100% on GPT-4o and 98% on Llama3-8B given an increased interaction budget—all while maintaining remarkable efficiency.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 17914
Loading