CoaxChain: Semantically Progressive Multi-turn Jailbreak Attacks on Large Language Models

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Prompt Injection, LLM, Multi-turn Jailbreak Attacks
TL;DR: We propose CoaxChain, a multi-turn jailbreak framework that semantically bypasses alignment in LLMs by gradually weakening safety mechanisms through progressive prompt rewriting.
Abstract: To design robust defenses for large language models (LLMs), it is essential to first systematically study jailbreak attacks, as understanding attack strategies provides the foundation for building effective safeguards. Among various attack types, multi-turn jailbreak attacks are particularly concerning because they can gradually steer conversations from benign topics to harmful instructions, often bypassing even commercial safety defenses. However, existing jailbreak methods rely on frequent trial-and-error interactions with the target model, which makes the process slow, costly, and prone to detection. To address these challenges, we propose CoaxChain, a structured black-box multi-turn jailbreak framework based on semantically progressive prompting, which consists of two key components: the Alignment Failure Analyzer (AFA) that performs offline analysis to identify effective prompts and avoid risky trial-and-error interactions with the target model, and the Semantically Progressive Prompt Generator (SPG) that leverages AFA’s insights to produce compact, semantically progressive multi-turn dialogue sequences that enhance both attack efficiency and stealthiness. We evaluate CoaxChain on GPT-4o, Claude 3.7, and Gemini 2.5, where it achieves an average success rate of 82.56\% with only three turns, surpassing strong baselines such as Crescendo and ActorAttack, while further improving prompt generation efficiency by 80\% compared to ActorAttack.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 8737
Loading