Do Thinking Tokens Help with Safety?

Published: 26 May 2026, Last Modified: 26 May 2026ICML 2026 FoGen Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Interpretability for AI Safety
TL;DR: Reasoning models often tend to decide whether they will refuse or comply before reasoning begins, and current safety defenses mostly shift refusal tendencies rather than making reasoning truly useful for safety deliberation.
Abstract: Today's reasoning models use thinking tokens to attain stronger performance on benchmarks than their instruction-tuned counterparts. It is also generally believed that this more "deliberative" mode should improve alignment and safety, by providing the model a safe space to consider whether its planned answer to the request violates its safety principles. We present evidence that this intuition is not always correct. Across frontier open-weight reasoning models spanning GPT-OSS, Phi, OLMo, and Qwen model families, we find that the model's decision is already strongly encoded at the beginning of thinking, with a probe on the first token's hidden representation predicting refusal/compliance with $\ge$ 85% AUROC and $\sim$ 90% balanced accuracy. We also find little evidence that current models can use their thinking trace to deliberate about safety, as additional thinking after the first 20% of the trace rarely moves the final decision. While sentence-level inspection of thinking traces may show signs of oscillation between refusal- and compliance-leaning rationales, we find that in $\ge$ 85% of thinking traces, such oscillations exert limited to no influence on the final response. We also examine the effect of existing inference-time and training-based safety interventions and find that they largely alter thinking behavior by shifting models toward more refusal-leaning reasoning while substantially reducing helpfulness on benign prompts. Our results suggest that safety behavior in current reasoning models is much less deliberative than commonly assumed, highlighting the need for training methods that more effectively utilize thinking traces for safety-critical decision making.
Submission Number: 179
Loading