Keywords: Interpretability for AI Safety, Alignment Faking, Feature Geometry
TL;DR: Naturally-emerging alignment faking leaves an asymmetric refusal residue that probes can sometimes read per-sample (AUROC 0.87 on Llama, chance on Qwen) but cannot causally steer.
Abstract: Alignment faking---strategic compliance with training objectives to avoid weight modification---is a growing safety concern, documented in Claude 3 Opus. Prior detection work targets *installed* faking (LoRA, sleeper agents) or *instructed* role-play deception, reporting AUROCs above 0.95. We study the deployment-relevant case: *naturally-emerging* faking with no scratchpad visibility. A 13-model sweep finds only two natural fakers: Qwen3-32B (+18.2pp) and Llama-3.1-8B (+19.3pp; replicating to +24.4pp at $n=10$, $p<10^{-15}$). CoT-based monitoring fails: a Claude Opus 4 judge catches just 1.4% of Qwen's faking events. On both fakers, monitoring leaves an asymmetric *refusal residue*: compliance representations shift toward refusal ($d=0.32$ Qwen, $d=0.33$ Llama, both $p<0.0001$) while refusal representations are unchanged ($p=1.00$)---a structural asymmetry with no analogue in installed- or instructed-deception work. Per-sample readability is model-conditional. The standard residualized-probing protocol leaks across CV folds, reaching AUROC 0.63 on a refuse-vs-refuse control where no faking can occur. Under leakage-free leave-one-query-out, detection collapses to chance on Qwen (0.43) but survives on Llama (0.87); cross-model Procrustes transfer is null under a random-Procrustes alignment-specificity baseline. Activation steering over 2,000 runs is null both single- and multi-direction ($|h|<0.08$)---to our knowledge the first predictive-vs-additively-causal dissociation for alignment faking. We release a five-control methodological framework; without it, naive linear probes hit AUROC$=1.0$ trivially and conventional MLPs overstate detectability by 0.2--0.3 AUROC.
Submission Number: 274
Loading