REPA: : Reproducibility Evaluation via an Autonomous Pipeline Architecture

Published: 30 May 2026, Last Modified: 30 May 2026ICML2026-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Track 1: Original Research/Position/Education/Attention Track
TL;DR: reproducibility pipeline that uses LLMs to generate, execute, and evaluate ML experiments from paper descriptions, succeeds for well specified experiments with standard architectures and moderate dataset sizes.
Abstract: We present REPA, a framework to autonomously reproduce text classification experiments from paper descriptions, without access to reference code or repositories. Unlike prior AI~Scientist systems, REPA targets the reproduction problem directly by deconstructing it as a four-stage process incorporating protocol extraction, input preparation, experiment generation, and evaluation. On a favorable set of ten well-documented text classification papers, REPA replicates eight studies when instantiated with a GPT-4o backend, and five with Qwen3-Coder-30B. Compared to a replication rate of zero via direct prompting, our results with REPA establish a performance ceiling for current LLM automation as of mid-2026 and the importance of template scaffolding in scientific reproduction success.
Keywords: reproducibility, automated, ai scientist, llms
Submission Number: 32
Loading