Scorecard of AI Benchmark Quality

Ayrton San Joaquin; Rokas Gipiškis; Ze Shen Chin

Scorecard of AI Benchmark Quality

Ayrton San Joaquin, Rokas Gipiškis, Ze Shen Chin

Published: 29 Apr 2026, Last Modified: 29 Apr 2026Eval Eval @ ACL 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: benchmark, quality, governance, lifecycle, metaevaluation, evaluation

TL;DR: We create a scorecard that (1) identifies dimensions relevant to the quality of a benchmark and (2) a classification system to identify for which evaluation contexts these benchmarks are appropriate in.

Abstract: Effective AI risk assessment relies on the quality of evaluations. Currently, there are large quality differences, such as in construct validity and annotation, between existing benchmarks. In this work, we propose a quality scorecard for benchmarks designed to make this diversity easier to navigate. The scorecard employs two main components: dimensions, which provide granular scores of an evaluation under that dimension, and classifications, which correspond to concrete use-cases ranging from research to post-deployment. By establishing a common language and objective methods, this framework aims to aid in transparency and raise the baseline quality of benchmarks used across the ecosystem.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Type: Research Paper

Archival Status: Archival

Submission Number: 59

Loading