SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

ACL ARR 2026 March Submission1722 Authors

17 Mar 2026 (modified: 07 Jun 2026)ACL ARR 2026 March SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: software engineering benchmarks, code agents
Abstract: Real-world software engineering requires de- velopers to interpret high-level requirements, coordinate changes across many files, and evolve codebases over multiple iterations while preserving functionality. Yet current bench- marks for AI coding agents evaluate only iso- lated, single-issue tasks such as fixing one bug or adding a small feature. We introduce SWE- EVO, a benchmark that closes this gap by tar- geting long-horizon software evolution. Con- structed from release notes of seven mature open-source Python projects, SWE-EVO com- prises 48 tasks requiring multi-step modifica- tions spanning an average of 21 files, validated against test suites averaging 874 tests per in- stance. Experiments with two agent frame- works and 18 state-of-the-art models reveal a striking capability gap: GPT-5.4 with Open- Hands achieves only 25% on SWE-EVO ver- sus 72.80% achieved by GPT-5.2 on SWE- Bench Verified, showing that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress on these complex, long-horizon tasks.
Paper Type: Long
Research Area: NLP and Code Models
Research Area Keywords: AI/LLM Agents, Resources and Evaluation, Code Models
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 1722
Loading