Keywords: science of science, epistemic iteration, science of evals, field building, evaluation infrastructure, economics of innovation
TL;DR: Decades of hard-won knowledge about how scientific fields mature, self-correct, and fail to self-correct are directly applicable to AI evaluation and the field is largely ignoring them.
Abstract: A growing consensus holds that AI evaluation should become a science, yet the field has largely pursued this goal without engaging the disciplines that study how sciences form, mature, and self-correct. We draw on three overlapping traditions, the science of science,
the meta-science and open science movement, and the economics of innovation, to address three questions critical to AI evaluation’s maturation: How does the AI evaluation ecosystem work as a knowledge system, and how can we measure it? How can we make AI evaluation more reliable, transparent, cumulative, and selfcorrecting? What structural conditions produce or suppress evaluation quality, and how do we design institutions accordingly? We diagnose AI evaluation as a field in early mobilization whose dominant benchmark paradigm shows degenerative characteristics, apply frameworks from field formation theory, the philosophy of scientific research programs, and culture change models to specify where the field stands, and propose a staged maturation agenda organized by Nosek’s culture change pyramid. We argue that building the capacity to study AI evaluation scientifically is both a marker of and a prerequisite for the field’s maturation, and that doing so produces returns well beyond tracking progress, including accelerating
methodological innovation, strengthening the evidence base for AI governance, and making the field’s knowledge cumulative rather than ad hoc.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Type: Provocation
Archival Status: Non-archival
Submission Number: 61
Loading