Keywords: Evaluation, Benchmarking, Validity, LLMs
TL;DR: This paper addresses the lack of standardization in AI benchmark validation by proposing a set of robust best practices and releasing BenchBench, a software toolkit and leaderboard to ensure more reliable and reproducible comparisons.
Abstract: Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks.
A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., Spearman correlation).
Despite the crucial role of BAT for benchmark builders and consumers, there are no standardized procedures for such agreement testing, which can lead to invalid conclusions and mistrust.
By analyzing over 40 prominent benchmarks, we show how some overlooked methodological choices can significantly influence BAT results. To address these inconsistencies, we propose a set of best practices and demonstrate their impact on robustness and validity.
To foster adoption and facilitate future research, we introduce BenchBench, a Py package and Leaderboard for BAT, links in the App.
Submission Number: 245
Loading