Mitigating Sycophancy in Language Models via Sparse Activation Fusion and Multi-Layer Activation Steering

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse Autoencoders, Steering, AI Safety
Other Keywords: Causal Interventions, Applications of interpretability, Understanding high-level properties of models, Circuit Analysis
TL;DR: We develop two methods to stop LLMs from agreeing with wrong users. SAF removes bias per query using sparse features (63%→39% sycophancy). MLAS removes pressure signals across layers (78%→0% false admits). No retraining needed.
Abstract: Instruction-tuned large language models (LLMs) often exhibit sycophancy—a tendency to agree with a user’s stated opinion even when it is factually wrong. In this work, we present two complementary inference-time interventions to mitigate this behavior using tools from mechanistic interpretability. First, we propose Sparse Activation Fusion (SAF), which addresses the prompt-dependence of sycophancy. Unlike prior methods that rely on global steering directions, SAF dynamically estimates and subtracts user-induced bias within a sparse feature space for each query. On the SycophancyEval QnA benchmark with opinion cues, SAF lowers sycophancy from 63\% to 39\% and doubles accuracy when the user’s opinion is wrong, while maintaining performance when the user is correct. Second, we introduce a multi-layer activation steering method that identifies a “pressure” direction in the residual stream—capturing the model’s internal state when its initial answer followed up with a strong user agreement. By ablating this direction across targeted layers, we reduce the rate of responses where the model admits false positives as correct from 78.0\% to 0.0\% on the SycophancyEval Trivia benchmark, while preserving baseline accuracy. Together, these methods demonstrate two effective and interpretable paths to improving LLM truthfulness without retraining. We will publicly release code, data, and other artifacts upon acceptance.
Submission Number: 244
Loading