Keywords: Sycophancy, interpretability, alignment, llm behavior analysis
TL;DR: We show that sycophantic agreement, genuine agreement, and sycophantic praise are distinct, independently steerable behaviors in LLMs, not a single mechanism.
Abstract: Large language models (LLMs) often exhibit sycophantic behaviors---such as excessive agreement with or flattery of the user---but it is unclear whether these behaviors arise from a single mechanism or multiple distinct processes. We decompose sycophancy into \emph{sycophantic agreement} and \emph{sycophantic praise}, contrasting both with \emph{genuine agreement}. Using difference-in-means directions, activation additions, and subspace geometry across multiple models and datasets, we show that: (1) the three behaviors are encoded along distinct linear directions in latent space; (2) each behavior can be independently amplified or suppressed without affecting the others; and (3) their representational structure is consistent across model families and scales. These results suggest that sycophantic behaviors correspond to distinct, independently steerable representations.
Primary Area: interpretability and explainable AI
Submission Number: 14409
Loading