Mitigating Self-Preference by Authorship Obfuscation

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI-Alignment, LM-as-a-judge, bias
Abstract: Language models (LMs) judges are widely used to evaluate the quality of LM outputs. Despite their advantages, LM judges display concerning biases, notably self-preference—preferring their own answers over those from other LMs or humans, even when the alternative is objectively better. Following the self-recognition hypothesis, we apply black-box perturbations to obfuscate authorship in pairwise comparisons, aiming to reduce harmful self-preference. Simple synonym replacement for a few words reduces bias, but eliminating all stylistic cues via paraphrasing can reverse the effect, revealing that self-preference operates on multiple semantic levels. These findings highlight both the promise and the challenge of mitigating bias in LM judges.
Submission Number: 9
Loading