AI Benchmarks: Interdisciplinary Issues and Policy Considerations

Published: 05 Jun 2025, Last Modified: 15 Jul 2025ICML 2025 Workshop TAIG PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI benchmarks, Benchmark critique, AI evaluation, Safety evaluation, AI Regulation
Abstract: Artificial Intelligence (AI) benchmarks have emerged as essential for evaluating AI performance, capabilities, and risks. However, as their influence grows, concerns arise about their limitations and side effects when assessing sensitive topics such as high-impact capabilities, safety and systemic risks. In this work we summarise the results of an interdisciplinary meta-review of approximately 110 studies over the last decade, which identify key shortcomings in AI benchmarking practices, including issues in the design and application (e.g., dataset creation biases, inadequate documentation, data contamination, and failures to distinguish signal from noise) and broader sociotechnical issues (e.g., over-focus on text-based and one-time evaluation logic, which neglects multimodality and interactions). We also highlight systemic flaws, such as misaligned incentives, construct validity issues, unknown unknowns, and the gaming of benchmark results. We underscore how benchmark practices are shaped by cultural, commercial and competitive dynamics that often prioritise performance at the expense of broader societal concerns. As a result, AI benchmarking may be ill-suited to provide the assurances required by policymakers. To address these challenges, it is crucial to consider key policy aspects that can help mitigate the shortcomings of current AI benchmarking practices.
Submission Number: 15
Loading