Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

Published: 29 Sept 2025, Last Modified: 12 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: self-preference bias, synthetic data, preference data, synthetic preference, steering vectors, guardrails, alignment, safety
TL;DR: We discover "steering vectors" which can effectively reduce an LLM's self-preference bias when acting as a judge, although the method is unstable and risks interfering with legitimate evaluations.
Abstract: Large language models (LLMs) increasingly serve as automated evaluators, yet they suffer from self-preference bias: a tendency to favor their own outputs over those of other models. This bias hampers the trustworthiness of synthetically generated evaluation data, affecting downstream alignment tasks such as preference tuning and model routing. We introduce and evaluate a lightweight activation-based safeguard to mitigate this problem at inference time without costly retraining. We release a curated dataset that disentangles self-preference bias into valid and invalid examples of self-preference, construct steering vectors using two state-of-the-art methods, and compare our intervention against prompting and Direct Preference Optimization. We show that while our safeguard reliably deters self-preference bias in up to 97% of cases, it comes with a key limitation: a countervailing instability when applying the same vectors to legitimate evaluations. These findings underscore the need to develop lightweight tooling for reliable LLM-as-judge data, motivating future directions in robustness. We make our code publicly available at https://anonymous.4open.science/r/steering_self_preference-EEC6 for reproducibility.
Submission Number: 94
Loading