PaperBench: Evaluating AI’s Ability to Replicate AI Research

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: PaperBench evaluates AI agents’ ability to replicate ML research by replicating results from ICML 2024 papers.
Abstract: We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We open-source our code (https://github.com/openai/preparedness) to facilitate future research in understanding the AI engineering capabilities of AI agents.
Lay Summary: As AI systems grow more capable, we need to understand if they can independently conduct machine learning research - a capability that could accelerate scientific progress but also raise safety concerns. Our work introduces PaperBench, a benchmark that challenges AI systems to replicate 20 cutting-edge machine learning papers completely from scratch, requiring them to understand the research, write code, and successfully run experiments. We developed detailed assessment rubrics with the original paper authors to break down each replication task into hundreds of individually gradable components, turning a complex subjective evaluation into an objective assessment that can be automatically graded by other AI systems. When testing several advanced AI systems, we found that even the best-performing AI agent could only achieve a replication score of 27%, while human machine learning experts scored 41% under similar circumstances. We release PaperBench to provide a valuable tool for measuring how well AI systems can autonomously perform complex machine learning research, helping track progress as these capabilities advance and informing important decisions about AI development and governance.
Link To Code: https://github.com/openai/preparedness/tree/main/project/paperbench
Primary Area: General Machine Learning->Evaluation
Keywords: benchmark, evals, evaluations, dataset, tasks, engineering, agents, scaffold, coding, mle, research, r&d
Submission Number: 4863
Loading