Do Thinking Tokens Help with Safety?

Narutatsu Ri; Abhishek Panigrahi; Sanjeev Arora

Do Thinking Tokens Help with Safety?

Narutatsu Ri, Abhishek Panigrahi, Sanjeev Arora

Published: 11 Jun 2026, Last Modified: 16 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Interpretability for AI Safety

TL;DR: Reasoning models often tend to decide whether they will refuse or comply before reasoning begins, and current safety defenses mostly shift refusal tendencies rather than making reasoning truly useful for safety deliberation.

Abstract: Today's reasoning models use thinking tokens to attain stronger performance on benchmarks than their instruction-tuned counterparts. It is also generally believed that this more "deliberative" mode should improve alignment and safety, by providing the model a safe space to consider whether its planned answer to the request violates its safety principles. We present evidence that this intuition is not always correct. Across frontier open-weight reasoning models spanning GPT-OSS, Qwen, Olmo, and Phi families, we find that the eventual refusal/compliance outcome is already strongly readable before any visible thinking, with a probe on the first token's hidden representation predicting refusal/compliance with $0.84$—$0.95$ AUROC and $\sim88$\% balanced accuracy. Here, thinking turns out to behave more like prefix completion than deliberative revision, with the final outcome rarely changing after the first $\sim20$\% of thinking. Inspecting these thinking traces reveal that among segments that appear deliberative at the text level, only a minority affect the final outcome. More strikingly, $\sim74$\% of text-level deliberations occur when the response distribution is already locked to one refusal/compliance side, even as the trace continues to look deliberative. We also find that existing inference-time and training-based safety interventions, despite being motivated by the goal of activating deliberation, largely shift model behavior toward over-refusal while suppressing already scarce deliberation signals. Our results suggest that safety behavior in current reasoning models is much less deliberative than commonly assumed, and highlight the need for methods that induce real safety deliberation.

Submission Number: 251

Loading