Robust LLM-Based Scoring via Reference-Anchored ELO Estimation

Jaehun Song; Soonhwang Choi

Robust LLM-Based Scoring via Reference-Anchored ELO Estimation

Jaehun Song, Soonhwang Choi

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, LLM Evaluation, LLM Judge, Evaluation

TL;DR: We introduce Reference-Anchored Elo Estimation (RAEE), a framework that converts simple pairwise comparisons against a fixed reference into stable, reproducible scores with built-in uncertainty, overcoming the instabilities of direct LLM evaluation.

Abstract: LLM-based evaluation by direct (absolute) scoring suffers from systemic instabilities; ceiling compression constrains headroom, heavy-tailed score distributions inflate variance, and inconsistent agreement across independently trained judges induces scale drift that destabilizes rankings. We present *Reference-Anchored Elo Estimation* (RAEE), a principled framework that anchors all model comparisons to a fixed reference and expresses outcomes as win probabilities on a relative scale as an alternative to absolute scoring. We prove that, by design, RAEE minimizes judge-specific scale drift, suppresses between-judge variation, and yields analytic uncertainty estimates without costly resampling. Experimental results show that RAEE reduces per-run standard error by $\approx 44$\% and across-judge coefficient of variation by $\approx 72$\% relative to direct scoring, while preserving ranking stability even under reference changes. Robustness is observed across multiple domains, with RAEE sustaining low dispersion and consistent rankings despite task-specific difficulty shifts. Our analytic uncertainty bounds, which incorporate finite-population and reliability adjustments, predict observed variance within $\pm 12$\% on tested datasets. These results position RAEE as a statistically efficient, reproducible, and readily deployable alternative to conventional LLM-based evaluation.

Primary Area: datasets and benchmarks

Submission Number: 5449

Loading