Active Evaluation Acquisition for Efficient LLM Benchmarking

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: As large language models (LLMs) become increasingly versatile, numerous large scale benchmarks have been developed to thoroughly assess their capabilities. These benchmarks typically consist of diverse datasets and prompts to evaluate different aspects of LLM performance. However, comprehensive evaluations on hundreds or thousands of prompts incur tremendous costs in terms of computation, money, and time. In this work, we investigate strategies to improve evaluation efficiency by selecting a subset of examples from each benchmark using a learned policy. Our approach models the dependencies across test examples, allowing accurate prediction of the evaluation outcomes for the remaining examples based on the outcomes of the selected ones. Consequently, we only need to acquire the actual evaluation outcomes for the selected subset. We rigorously explore various subset selection policies and introduce a novel RL-based policy that leverages the captured dependencies. Empirical results demonstrate that our approach significantly reduces the number of evaluation prompts required while maintaining accurate performance estimates compared to previous methods.
Lay Summary: As AI language models grow more powerful, evaluating their capabilities has become increasingly expensive and time-consuming. Current benchmarks require running thousands of test questions (called prompts) on each model, which can cost millions of dollars and slow down progress. Our research tackles the challenge of how to evaluate these models more efficiently without sacrificing accuracy. We developed a method that learns which prompts are the most informative for each model. Instead of testing every model on every prompt, our system selects a small, customized set of prompts and predicts the rest using a statistical model trained on previous evaluations. This approach is inspired by how a doctor might diagnose a patient using only the most relevant tests. We tested our method on five major evaluation benchmarks, including those used by HuggingFace and Chatbot Arena, and found that we can cut evaluation costs significantly—sometimes by over 90%—while still producing accurate assessments. Our work enables faster and cheaper evaluation of new language models, making it easier for researchers and practitioners to understand model strengths and weaknesses, monitor progress, and ensure responsible deployment of AI systems.
Primary Area: Applications->Language, Speech and Dialog
Keywords: LLM Evaluation, Subset Selection, Active Learning
Submission Number: 11240
Loading