AI and the Everything in the Whole Wide World BenchmarkDownload PDF

Aug 20, 2021 (edited Aug 28, 2021)NeurIPS 2021 Datasets and Benchmarks Track (Round 2)Readers: Everyone
  • Keywords: evaluation, dataset, benchmark
  • TL;DR: The benchmarking paradigm in machine learning is incompatible with claims to performance on underspecified general tasks
  • Abstract: There is a tendency across different subfields in AI to see value in a small collection of influential benchmarks, which we term ``general'' benchmarks. These benchmarks operate as stand-ins or abstractions for a range of anointed common problems that are frequently framed as foundational milestones on the path towards flexible and generalizable AI systems. State-of-the-art performance on these benchmarks is widely understood as indicative of progress towards these long-term goals. In this position paper, we explore how such benchmarks are designed, constructed and used in order to reveal key limitations of their framing as the functionally ``general'' broad measures of progress they are set up to be.
13 Replies