Keywords: safety, jailbreaks, chain-of-thought
TL;DR: we introduce CoT hijacking, a new jailbreak for reasoning models
Abstract: Reasoning models are widely used to improve task performance by allocating
more inference-time compute, and prior work suggest it may also strengthen
safety by improving refusal. Yet we find the opposite: the same reasoning can
be used to bypass safety. We introduce Chain-of-Thought Hijacking, a jailbreak
attack on reasoning models. The attack pads harmful requests with long sequences
of harmless reasoning. Across HarmBench, CoT Hijacking reaches a 99% attack
success rate (ASR) on Gemini 2.5 Pro, far exceeding prior jailbreak methods. Our
mechanistic analysis shows that mid layers encode the strength of safety checking,
while late layers encode the verification outcome. Long benign CoT dilutes both
signals by shifting attention away from harmful tokens. Targeted ablations of at-
tention heads identified by this analysis causally increased ASR, confirming their
role in a safety subnetwork. These results show that the most interpretable form
of reasoning—explicit CoT—can itself become a jailbreak vector when combined
with final-answer cues. We release prompts, outputs, and judge decisions to facil-
itate replication.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24867
Loading