Keywords: Large Language Model, LLM Evaluation, LLM Judge, Evaluation
TL;DR: We introduce Reference-Anchored Elo Estimation (RAEE), a framework that converts simple pairwise comparisons against a fixed reference into stable, reproducible scores with built-in uncertainty, overcoming the instabilities of direct LLM evaluation.
Abstract: LLM-based evaluation by direct (absolute) scoring suffers from systemic instabilities; ceiling compression constrains headroom, heavy-tailed score distributions inflate variance, and inconsistent agreement across independently trained judges induces scale drift that destabilizes rankings.
We present *Reference-Anchored Elo Estimation* (RAEE), a principled framework that anchors all model comparisons to a fixed reference and expresses outcomes as win probabilities on a relative scale as an alternative to absolute scoring.
We prove that, by design, RAEE minimizes judge-specific scale drift, suppresses between-judge variation, and yields analytic uncertainty estimates without costly resampling.
Experimental results show that RAEE reduces per-run standard error by $\approx 44$\% and across-judge coefficient of variation by $\approx 72$\% relative to direct scoring, while preserving ranking stability even under reference changes.
Robustness is observed across multiple domains, with RAEE sustaining low dispersion and consistent rankings despite task-specific difficulty shifts.
Our analytic uncertainty bounds, which incorporate finite-population and reliability adjustments, predict observed variance within $\pm 12$\% on tested datasets.
These results position RAEE as a statistically efficient, reproducible, and readily deployable alternative to conventional LLM-based evaluation.
Primary Area: datasets and benchmarks
Submission Number: 5449
Loading