Automated Benchmark Generation for Repository-Level Coding Tasks

Konstantinos Vergopoulos; Mark Niklas Mueller; Martin Vechev

Automated Benchmark Generation for Repository-Level Coding Tasks

Konstantinos Vergopoulos, Mark Niklas Mueller, Martin Vechev

Published: 06 Mar 2025, Last Modified: 19 Apr 2025DL4C @ ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 9 pages)

Keywords: Code Generation, Automatic Benchmark Generation, LLMs, Benchmarking

TL;DR: We propose SetUpAgent to automatically generate SWE-Bench-like code generation benchmarks from a list of GitHub repositories and demonstrate its effectiveness by generating two new benchmarks.

Abstract: Code Agent development is an extremely active research area, where a reliable performance metric is critical for tracking progress and guiding new developments. This demand is underscored by the meteoric rise in popularity of SWE-Bench – a benchmark that challenges code agents to generate patches addressing GitHub issues given the full repository as context and then evaluates their correctness by executing the human-written test suite extracted from the repository after the issue’s resolution. However, constructing benchmarks like SWE-Bench requires substantial manual effort to set up historically accurate execution environments for testing. Crucially, this severely limits the number of considered repositories, e.g., just 12 for SWE-Bench. A danger of such a selection process is a distributional mismatch, i.e., the measured performance may not be representative of real-world scenarios potentially misguiding development efforts. In this work, we address this challenge and introduce SETUPAGENT, a fully automated system capable of historically accurate dependency setup, test execution, and result parsing. Using SETUPAGENT, we generate two new datasets: (i) an extended version of SWE-Bench encompassing hundreds of repositories, and (ii) a benchmark centered on applications rather than libraries. Comparing these datasets to SWE-Bench with respect to their characteristics and code agent performance, we find significant distributional differences, including lower issue description quality and detail level, higher fix complexity, and most importantly up to 40% lower agent success rates.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 6

Loading