Keywords: Large Language Models, Evaluators, LLM-as-a-Judge
TL;DR: We use verifiable benchmarks to examine whether LLM evaluators’ tendency to prefer their own outputs reflects legitimate preference or harmful bias.
Abstract: Large language models (LLMs) are increasingly used as automatic evaluators in applications such as benchmarking, reward modeling, and self-refinement.
Prior work highlights a potential *self-preference* bias where LLMs favor their own generated responses, a tendency often intensifying with model size and capability.
This raises a critical question:
Is self-preference harmful, or does it simply reflect the genuinely higher-quality outputs of stronger models?
Answering this has been difficult because previous studies relied primarily on subjective tasks.
These tasks lack an objective ground truth, meaning that either preference can be reasonably justified.
To address this ambiguity, we investigate self-preference using verifiable benchmarks (mathematical reasoning, factual knowledge, code generation) that allow objective ground-truth assessment.
This enables us to distinguish *harmful* self-preference (favoring objectively worse responses) from *legitimate* self-preference (favoring genuinely superior ones).
We conduct large-scale experiments under controlled evaluation conditions across diverse model families (*e.g.*, Llama, Qwen, Gemma, Mistral, Phi, GPT, DeepSeek).
Our findings reveal three key insights:
(1)
While stronger models exhibit greater self-preference, much of this preference aligns with objectively superior performance, indicating stronger models prefer themselves mostly legitimately.
(2) Harmful self-preference persists when evaluator models err as generators, and stronger models display more pronounced harmful self-preference bias when they do err.
This suggests stronger models struggle more to recognize when they are wrong.
(3) Inference-time scaling strategies, such as generating a long Chain-of-Thought before evaluation, effectively reduce the harmful self-preference.
These results provide a more nuanced understanding of LLM-based evaluation and practical insights for improving its reliability.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6486
Loading