Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Steering, Understanding high-level properties of models, Applications of interpretability
Other Keywords: Self-preference bias,Contrastive Additive Activation,Optimization-based Steering,Linear Activation Space
TL;DR: We discover "steering vectors" which can effectively reduce an LLM's self-preference bias when acting as a judge, although the method is unstable and risks interfering with legitimate evaluations.
Abstract: Large language models (LLMs) increasingly serve as automated evaluators, yet they suffer from $\textbf{self-preference bias}$: a tendency to favor their own outputs over those of other models. This bias undermines fairness and reliability in evaluation pipelines, particularly for tasks like preference tuning and model routing. We investigate whether lightweight $\textbf{steering vectors}$ can mitigate this problem at inference time without retraining. We introduce a curated dataset that distinguishes self-preference bias into justified examples of self-preference and unjustified examples of self preference, and we construct steering vectors using two methods: $\textbf{Contrastive Additive Activation (CAA)}$ and an $\textbf{optimization-based approach}$. Our results show that steering vectors can reduce unjustified self-preference bias by up to $\textbf{97}$%, substantially outperforming prompting and direct preference optimization baselines. Yet steering vectors are unstable on legitimate self-preference and unbiased agreement, implying self-preference spans multiple or nonlinear directions. This underscores both their promise and limits as safeguards for LLM-as-judges and motivates more robust interventions. We make our code publicly available for reproducibility: https://anonymous.4open.science/r/steering_self_preference-EEC6
Submission Number: 281
Loading