Sycophancy Is Often a Single-Layer Phenomenon

Valentin NOËL

Sycophancy Is Often a Single-Layer Phenomenon

Valentin NOËL

Published: 11 Jun 2026, Last Modified: 22 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Methods (probing, steering, causal interventions), Interpretability for AI Safety, Feature Geometry

TL;DR: Sycophancy lives in the top singular direction of one MLP down_proj layer in most LLMs, and a closed-form rescaling there switches it on or off, no training, no contrastive data, no inference overhead.

Abstract: Instruction tuned language models often defer to user opinions even when those opinions are factually wrong, a behavior known as sycophancy. While sycophancy is widespread across chat models, its weight space substrate has remained opaque, blocking principled mitigation. In this work, we show that sycophancy is mediated by the dominant singular subspace of a single MLP weight matrix in ten of eleven open source models spanning 1B to 14B parameters. Compressing this spectral direction monotonically reduces the forced choice sycophancy rate; amplifying it induces sycophancy on otherwise neutral inputs, establishing causal mediation without contrastive data. The per-layer spectral SNR does double duty: it identifies the target layer from model weights alone, and bounds the safe operating dose ($|\alpha| \ll 1/SNR_\ell$), a predictive criterion validated across model families before any behavioral evaluation. We propose a closed-form weight space intervention requiring no training, no contrastive data, and no inference-time machinery, and that on Gemma-4-E2B-it produces a strict Pareto improvement: less sycophancy and more reasoning simultaneously. The per-layer spectral profile partitions models into three storage classes, localised, weakly localised, and distributed, predicting in advance whether single-layer surgery will succeed. Our findings reframe the alignment tax as a measurable consequence of spectral geometry, and establish spectral diagnostics as a non-behavioral audit primitive for instruction tuning regimes.

Submission Number: 709

Loading