SAGE-CoT: Self-Adaptive Generated Chain-of-Thought for Jailbreaking

Jie Liao; Simeng Qin; Wei Zhou; Yihao Huang; Zhitao Zeng; Junhao Wen; Yang Liu; Xiaojun Jia

SAGE-CoT: Self-Adaptive Generated Chain-of-Thought for Jailbreaking

Jie Liao, Simeng Qin, Wei Zhou, Yihao Huang, Zhitao Zeng, Junhao Wen, Yang Liu, Xiaojun Jia

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Automated Jailbreaking, Chain-of-Thought, Self-Induced Reasoning Paths, Trust and Safety

TL;DR: SAGE-CoT is a black-box jailbreak framework that exploits meta-level reasoning to circumvent the safeguards of Large Reasoning Models (LRMs) —requiring no access to internal CoT traces or manual prompt engineering.

Abstract: Chain-of-thought (CoT) reasoning has strengthened the problem-solving ability of large reasoning models (LRMs), improving both interpretability and safety alignment. Yet this transparency introduces new attack surfaces: Recent jailbreak methods exploit CoT traces to elicit unsafe behaviors. Existing approaches, however, are limited by their reliance on observable CoT traces during attack construction or on manual prompt engineering. Moreover, many proprietary LRMs do not expose CoT traces to external users, making traditional CoT-based attacks difficult or even infeasible in realistic black-box scenarios. We propose \textbf{SAGE-CoT} (Self-Adaptive Generated Chain-of-Thought for Jailbreaking), a black-box framework that leverages an LRM's own meta-level reasoning to autonomously generate CoT templates capable of decoding hidden malicious instructions. SAGE-CoT consists of two key stages: (i) \textit{CoT Template Generation}, where a meta-instruction guides the LRM to elaborate a simple intent recovery template into a bespoke reasoning template tailored for malicious intent decoding, and (ii) \textit{Intent Obfuscation}, where the malicious instruction is disguised through semantic obfuscation, indexed word permutation, and noise injection. This design ensures that malicious intent is neither directly exposed in the input nor easily filtered during reasoning, allowing the attack to bypass both internal safety mechanisms and external defenses. Extensive experiments across six state-of-the-art jailbreak baselines and diverse LRMs demonstrate the effectiveness of SAGE-CoT. On GPT-o3-mini, it achieves a 90\% attack success rate, and on Gemini-2.5-Pro-Thinking, it reaches 96\%. We further show that SAGE-CoT maintains high effectiveness under advanced safety defenses. All code and datasets will be publicly released to ensure reproducibility. \textcolor{red}{(Warning: this paper contains potentially harmful content generated by LRMs.)}

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 10515

Loading