Chain-of-Thought Hijacking

ICLR 2026 Conference Submission24867 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: safety, jailbreaks, chain-of-thought
TL;DR: we introduce CoT hijacking, a new jailbreak for reasoning models
Abstract: Reasoning models are widely used to improve task performance by allocating more inference-time compute, and prior work suggest it may also strengthen safety by improving refusal. Yet we find the opposite: the same reasoning can be used to bypass safety. We introduce Chain-of-Thought Hijacking, a jailbreak attack on reasoning models. The attack pads harmful requests with long sequences of harmless reasoning. Across HarmBench, CoT Hijacking reaches a 99% attack success rate (ASR) on Gemini 2.5 Pro, far exceeding prior jailbreak methods. Our mechanistic analysis shows that mid layers encode the strength of safety checking, while late layers encode the verification outcome. Long benign CoT dilutes both signals by shifting attention away from harmful tokens. Targeted ablations of at- tention heads identified by this analysis causally increased ASR, confirming their role in a safety subnetwork. These results show that the most interpretable form of reasoning—explicit CoT—can itself become a jailbreak vector when combined with final-answer cues. We release prompts, outputs, and judge decisions to facil- itate replication.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24867
Loading