Position: Benchmarking is Broken - Don't Let AI be Its Own Judge

Zerui Cheng; Stella Wohnig; Ruchika Gupta; Samiul Alam; Tassallah Abdullahi; João Alves Ribeiro; Christian Nielsen-Garcia; Saif Mir; Siran Li; Jason Orender; Seyed Ali Bahrainian; Daniel Kirste; Aaron Gokaslan; Carsten Eickhoff; Ruben Wolff

Position: Benchmarking is Broken - Don't Let AI be Its Own Judge

Zerui Cheng, Stella Wohnig, Ruchika Gupta, Samiul Alam, Tassallah Abdullahi, João Alves Ribeiro, Christian Nielsen-Garcia, Saif Mir, Siran Li, Jason Orender, Seyed Ali Bahrainian, Daniel Kirste, Aaron Gokaslan, Carsten Eickhoff, Ruben Wolff

Published: 26 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 Position Paper TrackEveryoneRevisionsBibTeXCC BY-SA 4.0

Keywords: Benchmark, Trustworthy Evaluation, Generative AI, Data Integrity, Regulation, Fairness

TL;DR: Current AI benchmarks suffer from systematic flaws like data leakage and selective reporting. We propose PeerBench, a community-run eval platform with secret and live tests and reputation-weighted scoring to restore trust in AI performance claims.

Abstract: The meteoric rise of Artificial Intelligence (AI), with its rapidly expanding market capitalization, presents both transformative opportunities and critical challenges. Chief among these is the urgent need for a new, unified paradigm for trustworthy evaluation, as current benchmarks increasingly reveal critical vulnerabilities. Issues like data contamination and selective reporting by model developers fuel hype, while inadequate data quality control can lead to biased evaluations that, even if unintentionally, may favor specific approaches. As a flood of participants enters the AI space, this "Wild West" of assessment makes distinguishing genuine progress from exaggerated claims exceptionally difficult. Such ambiguity blurs scientific signals and erodes public confidence, much as unchecked claims would destabilize financial markets reliant on credible oversight from agencies like Moody's. In high-stakes human examinations (e.g., SAT, GRE), substantial effort is devoted to ensuring fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? This position paper argues that a laissez-faire approach is untenable. For true and sustainable AI advancement, we call for a paradigm shift to a unified, live, and quality-controlled benchmarking framework—robust by construction rather than reliant on courtesy or goodwill. Accordingly, we dissect the systemic flaws undermining today’s evaluation ecosystem and distill the essential requirements for next-generation assessments. To concretize this position, we introduce the idea of PeerBench, a community-governed, proctored evaluation blueprint that seeks to improve security and credibility through sealed execution, item banking with rolling renewal, and delayed transparency. PeerBench is presented as a complementary, certificate-grade layer alongside open benchmarks, not a replacement. We discuss trade-offs and limits and call for further research on mechanism design, governance, and reliability guarantees. Our goal is to lay the groundwork for evaluations that restore integrity and deliver genuinely trustworthy measures of AI progress.

Lay Summary: When companies claim their AI systems achieve "superhuman performance" or beat competitors on leaderboards, how can we trust these claims? This paper reveals that the current system for testing AI is fundamentally broken, much like if students could see exam questions beforehand or only report their best test scores. The main problems are: (1) AI systems often accidentally or deliberately train on the exact questions they'll be tested on, artificially inflating their scores—imagine studying with the answer key in hand; (2) Companies cherry-pick which tests to report, creating an illusion of superior performance; (3) Many widely-used tests are years old and no longer meaningful; (4) There's no oversight to ensure fair testing conditions. We argue that we should treat AI evaluation as seriously as, if not more than, standardized human exams such as the SAT or bar exam, which have rigorous procedures to ensure fairness. We propose PeerBench, a community-run platform where: (1) Test questions remain secret until used, then are immediately retired; (2) Multiple independent reviewers verify test quality; (3) Contributors and reviewers build reputations over time, with higher-quality contributions weighted more heavily; (4) All evaluations happen in controlled, monitored environments; (5) Fresh tests are continuously added to prevent memorization. This matters because as AI increasingly influences critical decisions in healthcare, finance, and other domains, we need trustworthy ways to measure actual capabilities versus marketing hype. Just as financial markets rely on credible rating agencies, the AI industry needs reliable benchmarking to distinguish genuine progress from exaggerated claims. The proposed system aims to restore scientific rigor to AI evaluation while maintaining the collaborative spirit of the research community.

Submission Number: 334

Loading