CoaxChain: Semantically Progressive Multi-turn Jailbreak Attacks on Large Language Models

Cheng LIU; Xiaolei Liu; Xingyu Li; Yixiao Xu; Bangzhou Xin; Kangyi Ding; Jia-Li Yin

CoaxChain: Semantically Progressive Multi-turn Jailbreak Attacks on Large Language Models

Cheng LIU, Xiaolei Liu, Xingyu Li, Yixiao Xu, Bangzhou Xin, Kangyi Ding, Jia-Li Yin

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Prompt Injection, LLM, Multi-turn Jailbreak Attacks

TL;DR: We propose CoaxChain, a multi-turn jailbreak framework that semantically bypasses alignment in LLMs by gradually weakening safety mechanisms through progressive prompt rewriting.

Abstract: To design robust defenses for large language models (LLMs), it is essential to first systematically study jailbreak attacks, as understanding attack strategies provides the foundation for building effective safeguards. Among various attack types, multi-turn jailbreak attacks are particularly concerning because they can gradually steer conversations from benign topics to harmful instructions, often bypassing even commercial safety defenses. However, existing jailbreak methods rely on frequent trial-and-error interactions with the target model, which makes the process slow, costly, and prone to detection. To address these challenges, we propose CoaxChain, a structured black-box multi-turn jailbreak framework based on semantically progressive prompting, which consists of two key components: the Alignment Failure Analyzer (AFA) that performs offline analysis to identify effective prompts and avoid risky trial-and-error interactions with the target model, and the Semantically Progressive Prompt Generator (SPG) that leverages AFA’s insights to produce compact, semantically progressive multi-turn dialogue sequences that enhance both attack efficiency and stealthiness. We evaluate CoaxChain on GPT-4o, Claude 3.7, and Gemini 2.5, where it achieves an average success rate of 82.56\% with only three turns, surpassing strong baselines such as Crescendo and ActorAttack, while further improving prompt generation efficiency by 80\% compared to ActorAttack.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 8737

Loading