Efficient Benchmarking via Bias-Bounded Subset Selection

Yan Zhuang, Junhao Yu, Qi Liu, Yuxuan Sun, Jiatong Li, Zhenya Huang, Enhong Chen

Published: 01 Jan 2025, Last Modified: 23 Jan 2026IEEE Transactions on Pattern Analysis and Machine IntelligenceEveryoneRevisionsCC BY-SA 4.0

Abstract: Evaluating AI systems, particularly large models, is an essential yet computationally expensive task. The use of extensive benchmarks often leads to substantial computational/human costs that may even exceed those of pretraining. The efficiency of AI model evaluation focuses on estimating the model’s score on the full benchmark based on its responses to a smaller subset. Various empirical selection methods have been proposed to identify valuable subsets within these benchmarks. In this paper, we formally define and approximate the subset selection problem inherent in efficient evaluation. We prove that this problem actually optimizes a submodular function and that a unified subset can be identified using a simple greedy algorithm. Importantly, this approach is the first to provide theoretical guarantees of bias control and generalizability in score estimation. Using language models as a case study, experimental results across 11 different benchmarks validate its superiority in estimating model scores and maintaining ranking consistency. It can achieve accurate score estimation using no more than 30% of the full benchmark, thus facilitating efficient and sparse benchmark design.

External IDs:doi:10.1109/tpami.2025.3598031