Keywords: LLM-as-a-judge, Bradley-Terry model, position bias, self-consistency, judge evaluation
Abstract: Large Language Model judges are widely used to rank texts and
text-generating systems through pairwise comparison, and their reliability
is typically assessed via three proxies: position bias, transitivity, and
pairwise agreement (self- or human-labeled). Because these proxies drive
judge selection and benchmarking, a substantial literature reporting that
judges perform poorly on them risks steering practitioners away from
otherwise capable evaluators. We argue this assessment is misleading.
Under the Bradley--Terry geometry underlying pairwise aggregation, each
proxy is dominated by close-rank-gap pairs, where inconsistency is
information-theoretically expected and individual verdicts contribute
little to the aggregate ranking; far-gap pairs carry the ranking signal
but barely move the proxies. We formalize this argument and validate it in
a controlled simulation and on two human-rated corpora: the proxies
correlate only weakly with ranking accuracy against gold, and their
predictive component concentrates in the far-gap regime. Judges should
therefore be assessed on rank-gap-conditional metrics, ideally against
human rankings. Code in the supplementary zip archive.
Paper Type: Short
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: calibration/uncertainty, hardness of samples, robustness
Contribution Types: Model analysis & interpretability, Data analysis, Theory
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: yes
Submission Number: 16098
Loading