Democratizing Evaluation with Infinity-Benchmarks: Sample-Level Heterogeneous Testing Over Arbitrary Capabilities
Keywords: foundation models, efficient evaluation, aggregation, lifelong benchmarking, heterogeneity
TL;DR: We introduce ∞-benchmarks to systematically and efficiently rank and evaluate the capabilities of foundation models, thus enabling the development of truly lifelong benchmarking of foundation models.
Abstract: Traditional fixed test datasets fall short in quantifying the open-ended potential of foundation models. In this work, we propose ∞-benchmarks, a new testing paradigm that combines individual evaluation datasets into a single, uniform, ever-expanding sample pool from which custom evaluations can be flexibly generated. An ∞-benchmark allows users to dynamically select a collection of sample-level evaluations that correspond to their specific capabilities of interest. By aggregating and reusing samples across various test sets, it enables the assessment of diverse capabilities beyond those covered by the original test sets, while mitigating overfitting and dataset bias through real-world diversity. Most importantly, it frames model evaluation as a collective process of aggregation and selection of sample-level tests.
The shift from multi-task benchmarks to ∞-benchmarks introduces two key challenges: (1) heterogeneity and (2) incompleteness. Heterogeneity refers to aggregating diverse metrics, including binary, numeric, and ordinal data, while incompleteness describes comparing models evaluated on different subsets of testing data. To address these challenges, we explore algorithms inspired by social choice theory which aggregate sparse, unequal measurements into reliable model scores. Our aggregation algorithm ensures identifiability (asymptotically recovering ground-truth scores) and rapid convergence, enabling accurate model comparisons with relatively little data. We introduce ∞-LLMBench for language models and ∞-LMMBench for vision-language models, unifying evaluations across leaderboards and arenas in these domains, and showcasing targeted querying over a wide-range of capabilities. Our algorithm recovers ground truth rankings with large Kendall τ correlations when compared to standard aggregation on homogeneous metrics, even with up to 95% of measurements missing. This approach reduces evaluation cost by up to 20× with little to no compromise in performance. Overall, we present the first large-scale ∞-benchmarks for lifelong, efficient evaluation of language and vision-language models which can aggregate over open-ended heterogeneous sample-level testing to evolve alongside the rapid development of foundation models.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3865
Loading