Breaking the Mirror: Examining Self-Preference in LLM Evaluators through Activation-Based Representations

Dani Roytburg; Matthew Bozoukov; Hongyu Fu; Matthew Nguyen; Jou Barzdukas; Narmeen Fatimah Oozeer

Breaking the Mirror: Examining Self-Preference in LLM Evaluators through Activation-Based Representations

Dani Roytburg, Matthew Bozoukov, Hongyu Fu, Matthew Nguyen, Jou Barzdukas, Narmeen Fatimah Oozeer

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM-as-a-judge, self-preference bias, activation steering, steering vectors, interpretability, representational entanglement, bias mitigation, synthetic data

TL;DR: We use activation steering to find and remove self-preference bias in LLM evaluators, but unstable performance on legitimate evaluations suggest a representational entanglement.

Abstract: Large language models (LLMs) increasingly serve as automated evaluators, yet they suffer from self-preference bias: a tendency to favor their own outputs over those of other models. This bias hampers the trustworthiness of synthetically generated evaluation data. Thus, we propose a methodology based on activation steering to modulate the internal representation of self-preference bias. We release a curated dataset that disentangles this bias into valid and invalid examples of self-preference, construct steering vectors using two state-of-the-art methods, and compare our intervention against prompting and Direct Preference Optimization. Although our approach finds linear representations for self-preference bias---changing outcomes in up to 97% of biased cases---we find that it comes with a key limitation: a countervailing instability when applying the same vectors to legitimate evaluations. These findings highlight the need to isolate self-preference biases in LLM-as-judge evaluations, motivating future directions in synthetic evaluation data.

Submission Number: 131

Loading