SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Jailbreak, Self-Evolving, Experience, Symbolic, Training-Free
Abstract: Large language models (LLMs) are increasingly equipped with safety alignment mechanisms, yet recent studies show that they remain vulnerable to jailbreak attacks that elicit harmful behaviors without explicit policy violations. Although automated jailbreak methods have been widely explored, they often lack systematic mechanisms for leveraging both successful and failed attack experiences, as well as principled ways to compose reusable attack rules under diverse constraints. Consequently, existing methods struggle to accumulate transferable knowledge over time and adapt reliably across different targets and evolving safety mechanisms. To address these limitations, we propose a Self-Evolving Rule-Driven Training-Free Jailbreak (SRTJ) framework, which discovers, composes, and refines attack strategies through interaction and feedback without updating model parameters. SRTJ couples experience-driven attack generation with answer set programming (ASP)-based rule selection and constraint-aware composition, using verifier feedback to refine successful strategies and analyze failure patterns. It further maintains a hierarchical rule memory that organizes distilled attack knowledge into long-term, middle-term, and short-term rules, balancing stable transferable strategies with transient adaptive behaviors. Extensive experiments on the mainstream jailbreak benchmark HarmBench demonstrate that SRTJ achieves strong and stable attack performance across different target LLMs, with improved robustness and generalization over existing jailbreak methods. The code is available at \url{https://anonymous.4open.science/r/SRTJ-E48B/}.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 85
Loading