Do Thinking Tokens Help with Safety?
Keywords: safety, alignment, large language models
TL;DR: Reasoning models often tend to decide whether they will refuse or comply before reasoning begins, and current safety defenses mostly shift refusal tendencies rather than making reasoning truly useful for safety deliberation.
Abstract: Today's reasoning models use thinking tokens to attain stronger performance on benchmarks than their instruction-tuned counterparts.
It is also generally believed that this more "deliberative" mode should improve alignment and safety, by providing the model a safe space to consider whether its planned answer to the request violates its safety principles. We present evidence that this intuition is not always correct. Across frontier open-weight reasoning models including GPT-OSS, Phi, OLMo, and Qwen, we find that the model's decision is already strongly encoded at the beginning of thinking, with a probe on the first token's hidden representation predicting refusal/compliance with $\ge 0.85$ AUROC and $\sim 90$\% balanced accuracy. We also find little evidence of genuine safety deliberation in thinking models, as additional thinking after the first $20$\% of the trace rarely moves the final decision. While sentence-level inspection of thinking traces show signs of oscillation between refusal- and compliance-leaning rationales, we find that such oscillations exert limited to no influence on the final response in $\geq 85$\% of thinking traces. Furthermore, existing inference-time and training-based safety interventions largely shift models toward refusal-leaning reasoning, substantially reducing helpfulness on benign prompts. Together, our results suggest that safety behavior in current reasoning models is far less deliberative than assumed, highlighting the need for training methods that use thinking traces effectively for safety-critical decisions.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 146
Loading