Keywords: Large Language Model, Automated Jailbreaking, Chain-of-Thought, Self-Induced Reasoning Paths, Trust and Safety
TL;DR: SAGE-CoT is a black-box jailbreak framework that exploits meta-level reasoning to circumvent the safeguards of Large Reasoning Models (LRMs) —requiring no access to internal CoT traces or manual prompt engineering.
Abstract: Chain-of-thought (CoT) reasoning has strengthened the problem-solving ability of large reasoning models (LRMs), improving both interpretability and safety alignment. Yet this transparency introduces new attack surfaces: Recent jailbreak methods exploit CoT traces to elicit unsafe behaviors. Existing approaches, however, are limited by their reliance on observable CoT traces during attack construction or on manual prompt engineering. Moreover, many proprietary LRMs do not expose CoT traces to external users, making traditional CoT-based attacks difficult or even infeasible in realistic black-box scenarios.
We propose \textbf{SAGE-CoT} (Self-Adaptive Generated Chain-of-Thought for Jailbreaking), a black-box framework that leverages an LRM's own meta-level reasoning to autonomously generate CoT templates capable of decoding hidden malicious instructions. SAGE-CoT consists of two key stages: (i) \textit{CoT Template Generation}, where a meta-instruction guides the LRM to elaborate a simple intent recovery template into a bespoke reasoning template tailored for malicious intent decoding, and (ii) \textit{Intent Obfuscation}, where the malicious instruction is disguised through semantic obfuscation, indexed word permutation, and noise injection. This design ensures that malicious intent is neither directly exposed in the input nor easily filtered during reasoning, allowing the attack to bypass both internal safety mechanisms and external defenses. Extensive experiments across six state-of-the-art jailbreak baselines and diverse LRMs demonstrate the effectiveness of SAGE-CoT. On GPT-o3-mini, it achieves a 90\% attack success rate, and on Gemini-2.5-Pro-Thinking, it reaches 96\%. We further show that SAGE-CoT maintains high effectiveness under advanced safety defenses. All code and datasets will be publicly released to ensure reproducibility. \textcolor{red}{(Warning: this paper contains potentially harmful content generated by LRMs.)}
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 10515
Loading