Keywords: reliability, evaluations, generalizability theory, rankings, leaderboards
Abstract: Agent benchmark rankings drive deployment decisions, model release claims, and policy citations, but their reliability has gone largely unmeasured. We apply a generalizability theoretic framework to seven benchmarks on the Holistic Agent Leaderboard (HAL), systematically decomposing score variance across model, task, and scaffold, quantifying for the first time the effects of scaffold on benchmark rank stability. We estimate the reliability of rankings at two levels of measurement: the model and the model-scaffold pair, which illustrate divergent reliabilities on the same data. Within a single benchmark, model-ranking reliability asymptotes near $E\hat{\rho}^2 \approx 0.52$ under item-only scaling, well below conventional thresholds; broadening across the seven HAL benchmarks raises the projected ceiling to $E\hat{\rho}^2 \approx 0.88$. The ceiling is therefore design-conditional: model-scaffold, benchmark-model, and benchmark-model-scaffold interactions attenuate with scaffold and benchmark count, not with item count, so reliable model rankings require evaluation breadth rather than benchmark size. We re-rank all seven HAL leaderboards and find substantial reordering; for example, on $\tau$-bench Airline, the model published as best drops 16 ranks once scaffold and model contributions are separated. We close with four principles for designing, reporting, and interpreting agent evaluations: declare the object of measurement, attribute improvements to the factor that varied, report reliability with uncertainty, and diversify contexts rather than only task counts.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 20
Loading