Breaking the Mirror: Examining Self-Preference in LLM Evaluators through Activation-Based Representations
Keywords: LLM-as-a-judge, self-preference bias, activation steering, steering vectors, interpretability, representational entanglement, bias mitigation, synthetic data
TL;DR: We use activation steering to find and remove self-preference bias in LLM evaluators, but unstable performance on legitimate evaluations suggest a representational entanglement.
Abstract: Large language models (LLMs) increasingly serve as automated evaluators, yet they suffer from self-preference bias: a tendency to favor their own outputs over those of other models. This bias hampers the trustworthiness of synthetically generated evaluation data. Thus, we propose a methodology based on activation steering to modulate the internal representation of self-preference bias. We release a curated dataset that disentangles this bias into valid and invalid examples of self-preference, construct steering vectors using two state-of-the-art methods, and compare our intervention against prompting and Direct Preference Optimization. Although our approach finds linear representations for self-preference bias---changing outcomes in up to 97% of biased cases---we find that it comes with a key limitation: a countervailing instability when applying the same vectors to legitimate evaluations. These findings highlight the need to isolate self-preference biases in LLM-as-judge evaluations, motivating future directions in synthetic evaluation data.
Submission Number: 131
Loading