Pair Difficulty Matters: Rethinking Pairwise LLM-as-a-Judge Evaluation and Consistency

Pair Difficulty Matters: Rethinking Pairwise LLM-as-a-Judge Evaluation and Consistency

ACL ARR 2026 May Submission16098 Authors

26 May 2026 (modified: 02 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM-as-a-judge, Bradley-Terry model, position bias, self-consistency, judge evaluation

Abstract: Large Language Model judges are widely used to rank texts and text-generating systems through pairwise comparison, and their reliability is typically assessed via three proxies: position bias, transitivity, and pairwise agreement (self- or human-labeled). Because these proxies drive judge selection and benchmarking, a substantial literature reporting that judges perform poorly on them risks steering practitioners away from otherwise capable evaluators. We argue this assessment is misleading. Under the Bradley--Terry geometry underlying pairwise aggregation, each proxy is dominated by close-rank-gap pairs, where inconsistency is information-theoretically expected and individual verdicts contribute little to the aggregate ranking; far-gap pairs carry the ranking signal but barely move the proxies. We formalize this argument and validate it in a controlled simulation and on two human-rated corpora: the proxies correlate only weakly with ranking accuracy against gold, and their predictive component concentrates in the far-gap regime. Judges should therefore be assessed on rank-gap-conditional metrics, ideally against human rankings. Code in the supplementary zip archive.

Paper Type: Short

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: calibration/uncertainty, hardness of samples, robustness

Contribution Types: Model analysis & interpretability, Data analysis, Theory

Languages Studied: English

EMNLP 2026 AI Reviewing Experiment: yes

Submission Number: 16098

Loading