What Do Refusal Tokens Learn? Fine-Grained Representations and Evidence for Downstream Steering

Rishab Alagharu; Ishneet Sukhvinder Singh; Anjali Batta; Jaelyn S. Liang; Shaibi Shamsudeen; Arnav Sheth; Kevin Zhu; Ashwinee Panda; Zhen Wu

What Do Refusal Tokens Learn? Fine-Grained Representations and Evidence for Downstream Steering

Rishab Alagharu, Ishneet Sukhvinder Singh, Anjali Batta, Jaelyn S. Liang, Shaibi Shamsudeen, Arnav Sheth, Kevin Zhu, Ashwinee Panda, Zhen Wu

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI Safety, Steering, Understanding high-level properties of models

Other Keywords: refusal

TL;DR: We show that categorical refusal tokens enable fine-grained, interpretable control of language model safety by reducing over-refusal without harming overall performance.

Abstract: We study whether categorical refusal tokens enable controllable and interpretable safety behavior in language models. Using a fine-tuned version of Llama-3 8B with categorical refusal tokens, we extract residual‑stream activations, compute sparse category‑specific steering vectors, and apply categorical steering at inference time to control refusal behavior. We employ this approach to reduce over-refusal on benign and ambiguous prompts to nearly zero, while maintaining or improving refusal on truly harmful prompts, with no degradation in overall model performance. Model diffing of steering vectors reveals low cross-model cosine similarity for four of the five categories, suggesting that the emergence of our refusal features is mediated by refusal token fine-tuning. Our preliminary results indicate that refusal tokens are promising for shaping fine-grained safety directions that facilitate targeted control and nuanced interpretability, especially for reducing over-refusal while preserving general model capabilities and safety.

Submission Number: 231

Loading