Keywords: mechanism design, AI evaluation, benchmarking, multi-tasking
TL;DR: We present a game-theoretic model showing AI benchmarks should weight tasks by welfare alignment, improvement cost, and measurement precision, and turn this into a practical “platinum item” certification rubric.
Abstract: AI benchmarks are frequently summarized by uniformly averaging item-level scores, implicitly treating every test item as equally valuable. This induces incentives to over-optimize for items that are trivial, socially irrelevant, or dominated by measurement noise. We model benchmarking as a multitask principal-agent game in which a benchmark designer chooses aggregation weights and a lab takes costly actions to improve their model. The optimal weights depend on normative welfare priorities, marginal costs of improvement, and measurement uncertainty. This analysis motivates \emph{platinum items}: items that (i) precisely measure (ii) welfare-aligned capabilities that are (iii) comparatively cheap to improve. We propose an operational rubric and a certification workflow, implemented via expert review and LLM-based judgments, to identify platinum items and reweight benchmark items.
Track: Short Paper
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 92
Loading