Automatic Dialectic Jailbreak: A Framework for Generating Effective Jailbreak Strategies

Automatic Dialectic Jailbreak: A Framework for Generating Effective Jailbreak Strategies

ICLR 2026 Conference Submission20071 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Jailbreak Attacks, Large Language Models, Multi-Objective Game, White-box Jailbreak, Black-box Jailbreak

TL;DR: A fully automated and end-to-end jailbreak pipeline eliminates the need for human-crafted inputs, and is inherently generalizable to black-box models

Abstract: Large language models (LLMs) can be jailbroken to produce malicious or unethical content with embedded jailbreaking prompts. Unfortunately, current jailbreak attack techniques suffer from adaptability issues due to reliance on the fixed evaluation models and incapability problems of surviving from a wide range of defense mechanisms. In this work, we propose to model the the jailbreak attack problem as a Stackelberg multi-objective game between two LLMs engaged in a Hegelian-Dialectic-style debate enabling the automatic generation of jailbreak strategy (ADJ). In the ADJ, iterative thesis-antithesis-synthesis cycles of Hegelian dialectical reasoning are executed to guarantee that both attacker and defender can maximize their own utility while minimizing that of their opponent. We propose to map the optimization problem from the original parameter space into a Hilbert space via Haar wavelet transformation, for efficiently extracting localized and structurally significant information. In this functional space, we solve a convex multi-objective optimization problem to construct a common descent direction that better aligns with the objectives in the ADJ. In order to ensure sufficient descent for each objective in ADJ, we construct a subset of descent components and directly integrate them into the optimization objective. We theoretically validate the existence of a Pareto–Nash equilibrium achieved by our Automatic Dialectic Jailbreak method and demonstrate that our algorithm is able to converge to this Pareto–Nash equilibrium.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 20071

Loading