TL;DR: Experiments on self-preference bias confound the accuracy of a model on a task with the accuracy of that same model when evaluating the task. Eliminating this confound substantially reduces reported self-preference.
Abstract: Recent research has shown that large language models (LLMs) favor their own outputs when acting as judges, undermining the integrity of automated post-training and evaluation workflows.
However, it is difficult to disentangle which behaviors are explained by narcissism versus experimental confounds.
Specifically, LLM evaluators may deliver self-preferring verdicts when comparing responses to questions they fail on; these verdicts may not depend on the identity of the author, but on evaluator quality.
We correct this by directly comparing the judge's voting distribution in cases where it evaluates itself versus another model.
This evaluator quality baseline reveals that only **51%** of examples in previous findings retain statistical significance against this null hypothesis, covering **89.6%** of total self-preference probability mass.
Finally, we compare the entropy of voting distributions, suggesting uncertainty-driven overlap, and show that our procedure enables more careful documentation against the backdrop of judge-bias research.
Lay Summary: Large language models (LLMs) can evaluate text at scale according to user-defined criteria — a capability widely used to benchmark other models. These setups, known as LLM-as-a-judge, have become standard in assessing and improving the capabilities of models on subjective tasks such as communication or creative writing.
Like human raters, LLM judges can exhibit systematic biases. A well-documented example is *self-preference*: models tend to rate outputs from the same model family more favorably, even without knowing they are doing so. This raises obvious concerns for evaluation integrity.
We argue that most reported self-preference is an artifact of experimental design rather than genuine narcissism. To show this, we make three observations.
1. **Self-preference is concentrated on tasks the evaluator fails at.** When an LLM-judge lacks the competence to assess a task — say, evaluating machine translation into a language it has little coverage of — its verdicts become unreliable. In these regimes, it tends to favor its own outputs, but this reflects poor judgment rather than identity-based bias.
2. **Poor evaluations are misattributed to narcissism.** When a judge is forced to choose between two responses to a task it cannot reliably assess, its verdict is effectively noise — regardless of whether the chosen response happens to come from itself or another model.
3. **True self-preference requires a specific signature.** If narcissism is real, a judge should make worse decisions *specifically* when one of the candidates is its own output, compared to choosing between two external options. We find this signature is absent in **89.6%** of reported instances — meaning the vast majority of what has been labeled self-preference is better explained by undifferentiated noise from low-competence judgments.
Originally Submitted Supplementary Material: zip
Link To Code: https://github.com/djroytburg/sanity_checks_for_self_preference
Primary Area: Social Aspects->Alignment
Keywords: self-preference, llm-as-judge, situational awareness, meta-evaluation, scalable oversight, evaluation noise, reliability of llms
Originally Submitted PDF: pdf
Submission Number: 13783
Loading