Stop Guessing When to Stop Testing: Efficient Model Evaluation with Just Enough Data

Ofir Arviv; Kristjan Greenewald; Yotam Perlitz; Hadar Mulian; Michal Shmueli-Scheuer; Leshem Choshen

Stop Guessing When to Stop Testing: Efficient Model Evaluation with Just Enough Data

Ofir Arviv, Kristjan Greenewald, Yotam Perlitz, Hadar Mulian, Michal Shmueli-Scheuer, Leshem Choshen

17 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Evaluation, Efficiency, Efficient Evaluation, Adaptive Testing, Adaptive Evaluation, Sequential Testing

TL;DR: An adaptive, assumption-light evaluation method that yields provable statistical guarantees and cuts evaluation cost by stopping early.

Abstract: The inherent rigidity of fixed-size benchmarks makes them an inefficient tool for model evaluation. Diverse evaluation objectives, including model ranking, model selection and testing throughout development, demand varying levels of statistical power. The mismatch between fixed sample sizes and these diverse needs results in either excessive computational cost or compromised reliability – a critical concern for model evaluation. To overcome these limitations, we call for adoption of sequential testing in our field. We provide an adaptive evaluation framework, that provides a principled way to navigate the trade-off between efficiency and reliability in model evaluation. Our framework combines the established statistical paradigm of sequential testing with stopping criteria tailored to common evaluation needs such as diminishing returns detection, and minimum detectable effect size. We demonstrate its ability to adaptively manage the efficiency-reliability trade-off on the Open VLM Leaderboard, including, for example, a 80% reduction in computational cost compared to fixed-size evaluation (with a 2.5-point CI width allowance) while maintaining statistical significance.

Primary Area: datasets and benchmarks

Submission Number: 9198

Loading