What Do Refusal Tokens Learn? Fine-Grained Representations and Evidence for Downstream Steering

Published: 30 Sept 2025, Last Modified: 09 Nov 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Open Source Links: https://github.com/RishabSA/interp-refusal-tokens
Keywords: AI Safety, Steering, Understanding high-level properties of models
Other Keywords: refusal
TL;DR: We show that categorical refusal tokens enable fine-grained, interpretable control of language model safety by reducing over-refusal without harming overall performance.
Abstract: Language models are fine-tuned for safety alignment to refuse harmful prompts. One such method involves fine-tuning a language model to generate categorical refusal tokens that distinguish the different types of refusals. In this work, we investigate whether categorical refusal tokens enable controllable, interpretable refusal behavior in language models. Specifically, using a fine-tuned version of Llama-3 8B Base with categorical refusal tokens, we extract residual‑stream activations and compute category‑specific steering vectors. We then apply the category-specific steering vectors at inference-time to control refusal behavior, reducing over-refusals on benign and ambiguous prompts to nearly 0, while maintaining refusal rates on truly harmful prompts and minimizing degradation to general model performance. We perform model diffing of steering vectors between Llama-3 8B Base and the refusal-token fine-tuned model, revealing low cross-model cosine similarity in four of the five categories, suggesting that the emergence of our identified refusal features is mediated specifically by refusal-token fine-tuning. Our results indicate that refusal tokens are promising for shaping fine-grained safety directions that facilitate targeted control, interpretability, and reduced over-refusals.
Submission Number: 231
Loading