Static Unit-Scale Bias Steering Transfers Poorly to a Reasoning-Distilled LLM

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Interpretability for AI Safety, Concept Discovery (e.g., SAEs, dictionary learning), Methods (probing, steering, causal interventions)
TL;DR: A probe direction that steers bias behavior in Llama remains decodable but does not transfer as a reliable audit/control signal in R1-Distill, so cognitive-bias safeguards need per-model revalidation.
Abstract: Activation steering is a leading technique for controlling LLM behavior, but its reliability in reasoning-distilled checkpoints is unclear. We study a common intervention class—static continuous addition of a pre-generation linear-probe direction—and find that a cognitive-bias direction that controls an instruction-tuned model does not transfer cleanly to a reasoning-distilled sibling under the same protocol. We construct a 470-item contrastive benchmark spanning 11 bias categories (base-rate neglect, conjunction fallacy, framing, and others) and compare matched-architecture pairs (Llama-3.1-8B-Instruct vs. R1-Distill-Llama-8B, OLMo-3 7B/32B Instruct vs. Think) and Qwen-3-8B's thinking toggle. Behavioral lure rates are scorer- and benchmark-scope dependent: under final-answer rescoring of preserved 470-item raw-response artifacts, R1-Distill has lower overall lure than Llama (25.5% vs. 33.2%) but remains highly vulnerable on base-rate and conjunction items; on the full 470-item OLMo-32B scale run, lure falls from 19.6% to 0.4%. Lure suppression is not treated as correctness; accuracy and other-response rates are analyzed jointly. The main result characterizing the scoped dissociation is that probe-direction steering on the three vulnerable categories produces a monotonic 37.5pp dose-response in Llama (lure rates span 31% at $\alpha = +5$ to 69% at $\alpha = -5$, zero incoherent outputs), while the original static continuous intervention in R1-Distill, applied across four candidate layers including its probe peak, yields only small non-monotonic lure-rate fluctuations (5.0pp full-sweep range at L31), not a stable dose-response; uncalibrated final-answer-span P0/T0 diagnostics likewise show no endpoint effect but have near-zero prompt-prefill KL. Appendix diagnostics additionally show that, in the available OLMo-family 32B comparison, the larger Instruct checkpoint has higher scored lure rate than 7B (14.9% → 19.6%) while the Think checkpoint remains near-zero (0.4%). Diagnostic analyses show that Qwen-3-8B's hard think/no-think template induces non-transferring P0 geometries and that within-CoT linear separability is non-stationary, but these diagnostics are not treated as evidence for a template-invariant semantic axis or causal CoT stages. These results support a narrow claim: a static unit-scale probe direction that steers Llama can fail to provide comparable behavioral control in R1-Distill-Llama-8B even when the same bias distinction remains linearly decodable. They do not rule out calibrated dynamic, broader multi-layer, SAE-feature, or nonlinear interventions.
Submission Number: 539
Loading