CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Published: 31 Dec 2024, Last Modified: 31 Dec 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 19% on the hardest level of tasks, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step toward building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: For the camera-ready version, we updated the framing and scope of the paper to show how our benchmark assesses whether AI agents can verify computational reproducibility and fixed a bug in the results: - Updated the research question ("In this paper, we ask: Can AI Agents Enable Verification of the Computational Reproducibility of Published Scientific Research?") and added a paragraph to clarify how we focus on verifying code reproducibility. - Added a paragraph to the end of Section 2.1 to explain how the benchmark focuses on verifying reproducibility, and why that means we don't include irreproducible capsules. - Updated result accuracies throughout the paper to reflect a bug we fixed in the evaluation harness. The fixed results are about 1.5% lower than what was reported in the original version. We also previously made the following changes to address reviewer feedback: - Updated the header to Section 4.1 since AutoGPT does slightly better on Medium than Easy difficulty level. - Updated wording about how the code being from public repositories does not fully mitigate contamination concerns. - Added a qualitative analysis of agent failures on CORE-Bench-Hard. - Added a table indicating the amount of time it takes for each task to complete by agent-model pair to the appendix. - Added more examples of task questions to the paper appendix. - Added the proportion of visual and written questions by difficulty level to the appendix. - Added the proportion of R/Python capsules by discipline in the appendix. - Updated the supplementary code to provide a cheap baseline to evaluate our harness.
Video: https://www.youtube.com/watch?v=Nrml8ta3PFc
Code: https://github.com/siegelz/core-bench
Supplementary Material: zip
Assigned Action Editor: ~Yonatan_Bisk1
Submission Number: 3380
Loading