everyone">EveryoneCC BY 4.0
Given that Large Language Models (LLMs) have made significant progress in writing code, can they now be used to autonomously reproduce results from research repositories? Such a capability would be a boon to the research community, helping researchers validate, understand, and extend prior work. To measure this, we present SUPER, the first benchmark on setting up and executing tasks from repositories. To capture realistic tasks facing researchers, we derive tasks from 42 research repositories in Machine Learning (ML) and Natural Language Processing (NLP) research papers. Each task is annotated with an expert solution which is used to create a benchmark of 152 scenarios. Each scenario captures specific challenges associated with reproduction of experiments (e.g. configuring a trainer). We develop two evaluation measures to measure task success as well as progress towards the task. We show that state-of-the-art approaches struggle to solve these scenarios with the best model (GPT-4o) solving only 38.3% of the scenarios. This illustrates the challenges still present in this important task, and suggests that SUPER can serve as a valuable resource for the community to make and measure progress.