Keywords: Circuit Analysis, Attribution Graphs, Methods (probing, steering, causal interventions), Interpretability for AI Safety
Other Keywords: Sparse Autoencoders, Steering
TL;DR: In 12 open-weight LLMs, the same deference-gating attention heads detect false statements. Silencing them flips sycophancy 28→81% on Gemma-2-2B without touching factual accuracy, and alignment training masks, but doesn't remove the circuit.
Abstract: When a language model sycophantically agrees with a user's false belief, is it failing to detect the error, or noticing and agreeing anyway? We show the second. Across twelve open-weight models from five labs ($1.5$B-$72$B), the same small set of attention heads carries a "this statement is wrong" signal whether the model is evaluating an isolated claim or being pressured to agree with a user. Silencing these heads in Gemma-2-2B flips sycophancy from $28\%$ to $81\%$ while factual accuracy moves only from $69\%$ to $70\%$; the circuit controls deference, not knowledge. Edge-level path patching confirms the same connections between heads span sycophancy, factual lying, and instructed lying ($r{>}0.97$ on Gemma-2-2B, $r{=}0.988$-$0.995$ on Phi-4). Opinion-agreement, where there is no factual ground truth, reuses these head positions but writes into an orthogonal direction, so the substrate is not a relabeled "truth direction." Alignment training masks but does not remove this circuit: Meta's Llama-3.1$\to$3.3 RLHF refresh cut sycophancy tenfold while the shared heads persisted and the projection-ablation effect grew (substrate persistence replicates on Mistral$\to$Zephyr at 7B, independent family), and our own anti-sycophancy DPO reduced sycophancy $46$-$93\%$ on two models without moving probe transfer. When these models sycophant, they register the error and agree anyway.
Submission Number: 90
Loading