- Keywords: evaluation, dataset, benchmark
- TL;DR: The benchmarking paradigm in machine learning is incompatible with claims to performance on underspecified general tasks
- Abstract: There is a tendency across different subfields in AI to see value in a small collection of influential benchmarks, which we term ``general'' benchmarks. These benchmarks operate as stand-ins or abstractions for a range of anointed common problems that are frequently framed as foundational milestones on the path towards flexible and generalizable AI systems. State-of-the-art performance on these benchmarks is widely understood as indicative of progress towards these long-term goals. In this position paper, we explore how such benchmarks are designed, constructed and used in order to reveal key limitations of their framing as the functionally ``general'' broad measures of progress they are set up to be.