Keywords: Interpretability for AI Safety, Methods (probing, steering, causal interventions)
Other Keywords: deliberative alignment, prefilling, CoT
TL;DR: The paper investigates refusal in a deliberative-alignment-trained reasoning model.
Abstract: Refusal in non-reasoning LLMs has been characterized as a single linear direction, but reasoning models add an explicit deliberation channel that may complicate this mechanism. We investigate refusal in GPT-OSS-120B using activation steering and counterfactual chain-of-thought (CoT) prefilling, and identify three separable directions in the residual stream: a \emph{harmful} vector capturing pre-deliberative harmfulness, a \emph{mismatch} vector encoding prompt, CoT coherence, and a standard \emph{refusal} vector. Their causal profiles differ sharply: the harmful vector produces smooth shifts but degrades capability at strength; the refusal vector, despite the highest linear separability, is causally brittle and collapses the model into endless deliberation; the mismatch vector instead modulates whether the model trusts its own reasoning, with negative steering inducing snap compliance and positive steering driving self-doubt and recursive loops. Combining mismatch steering with harmless CoT prefilling drives compliance on harmful prompts that resist either intervention alone, with less collateral damage than refusal steering, and yields functionally harmful behavior on AgentHarm. We interpret refusal in deliberatively aligned models not as a single linear feature but as the interaction of an intuitive harmfulness signal, explicit CoT reasoning, and a coupling mechanism (implemented by the mismatch vector) that gates whether reasoning overrides intuition.
Submission Number: 531
Loading