How Benchmark Prediction from Fewer Data Misses the Mark

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Evaluation, Large language model
Abstract: Evaluating large language models (LLMs) is increasingly costly, motivating methods to speed up evaluation by compressing benchmark datasets. Benchmark prediction aims to select a small subset of evaluation points and predict overall performance from that subset. We systematically assess 11 benchmark prediction methods across 19 benchmarks. First, we identify a strong baseline: take a random sample and fit a regression to predict the missing entries, which outperforms most existing methods and challenges the need for careful subset selection. Second, we show that all methods rely on model similarity: performance degrades markedly when extrapolating to stronger models than those used for training, where few methods beat a simple sample average. We introduce an augmented inverse propensity weighting (AIPW) estimator that consistently improves over the random sample average under both interpolation and extrapolation, though gains remain modest and still depend on similarity. This shows that benchmark prediction fails just when it is most needed: at the evaluation frontier, where the goal is to evaluate new models of unknown capabilities.
Submission Number: 31
Loading