Position: AI Should Not Be An Imitation Game: Centaur Evaluations

Andreas Haupt; Erik Brynjolfsson

Position: AI Should Not Be An Imitation Game: Centaur Evaluations

Andreas Haupt, Erik Brynjolfsson

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 Position Paper Track posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

TL;DR: Argues that centaur evaluations, in which humans and systems together solve tasks, are feasible and direct technological progress towards human augmentation, to balance power and ensure societal welfare, in addition to other benefits.

Abstract: Benchmarks and evaluations are central to machine learning methodology and direct research in the field. Current evaluations commonly test systems in the absence of humans. This position paper argues that the machine learning community should increasingly use _centaur evaluations_, in which humans and AI jointly solve tasks. Centaur Evaluations refocus machine learning development toward human augmentation instead of human replacement, they allow for direct evaluation of human-centered desiderata, such as interpretability and helpfulness, and they can be more challenging and realistic than existing evaluations. By shifting the focus from _automation_ toward _collaboration_ between humans and AI, centaur evaluations can drive progress toward more effective and human-augmenting machine learning systems.

Lay Summary: To make decisions on which Artificial Intelligence system (e.g., ChatGPT, Claude, or Gemini) to use for a task, we need to know which ones are good at the task at hand. Currently, many of these tasks test models on how they perform on human activities, such as solving mathematical problems, or summarization. We argue that we need to include humans in the evaluation, e.g., by letting many humans solve a writing or coding task together with different Artificial Intelligence models and comparing the outcomes.

Primary Area: Research Priorities, Methodology, and Evaluation

Keywords: evaluation, benchmarks, human augmentation, human replacement, Turing trap, centaurs

Submission Number: 203

Loading