TL;DR: Centaur evaluations, in which humans and systems together solve tasks, are feasible and and desirable.
Abstract: Benchmarks and evaluations are central to machine learning methodology and direct research in the field. Current evaluations commonly test systems in the absence of humans. This position paper argues that the machine learning community should increasingly use _centaur evaluations_, in which humans and AI jointly solve tasks. Centaur Evaluations refocus machine learning development toward human augmentation instead of human replacement, they allow for direct evaluation of human-centered desiderata, such as interpretability and helpfulness, and they can be more challenging and realistic than existing evaluations. By shifting the focus from _automation_ toward _collaboration_ between humans and AI, centaur evaluations can drive progress toward more effective and human-augmenting machine learning systems.
Lay Summary: To make decisions on which Artificial Intelligence system (e.g., ChatGPT, Claude, or Gemini) to use for a task, we need to know which ones are good at the task at hand. Currently, many of these tasks test AI systems on how they perform when human activities, such as solving mathematical problems, or summarization, without interacting with humans. We argue that we need to include humans in the evaluation, e.g., by letting many humans solve a writing or coding task together with different AI systems and comparing the outcomes.
Verify Author Names: My co-authors have confirmed that their names are spelled correctly both on OpenReview and in the camera-ready PDF. (If needed, please update ‘Preferred Name’ in OpenReview to match the PDF.)
No Additional Revisions: I understand that after the May 29 deadline, the camera-ready submission cannot be revised before the conference. I have verified with all authors that they approve of this version.
Pdf Appendices: My camera-ready PDF file contains both the main text (not exceeding the page limits) and all appendices that I wish to include. I understand that any other supplementary material (e.g., separate files previously uploaded to OpenReview) will not be visible in the PMLR proceedings.
Latest Style File: I have compiled the camera ready paper with the latest ICML2025 style files <https://media.icml.cc/Conferences/ICML2025/Styles/icml2025.zip> and the compiled PDF includes an unnumbered Impact Statement section.
Paper Verification Code: ZTU2O
Permissions Form: pdf
Primary Area: Research Priorities, Methodology, and Evaluation
Keywords: evaluation, benchmarks, human augmentation, human replacement, Turing trap, centaurs
Submission Number: 203
Loading