SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

ICLR 2026 Conference Submission21730 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: agent, swe, llm, environment

TL;DR: We present SWE-Bench Pro, a benchmark for evaluating agent capability in long-horizon software engineering tasks.

Abstract: We present SWE-Bench Pro, a comprehensive benchmark designed to evaluate software engineering capabilities through complex, realistic programming challenges. This benchmark extends beyond traditional algorithmic problems to encompass the full spectrum of professional software development tasks. The dataset comprises 1,865 problems sourced from 41 active software engineering repositories, spanning 123 unique programming languages and various application domains. The benchmark is structured into public and private components, with public access to problems from 11 repositories and private evaluation sets from 12 repositories across 4 distinct problem categories. SWE-Bench Pro addresses limitations of existing evaluation frameworks by incorporating problems that reflect real-world software engineering scenarios, including substantial codebases, complex enterprise applications, and multi-file projects requiring sophisticated reasoning and code modification skills. Problems range from early-stage startup environments to enterprise-level applications, with the private commercial set remaining inaccessible to maintain evaluation integrity while enabling public access to representative problems for professional development. Our evaluation methodology employs diverse coding approaches and models under controlled conditions, ensuring robust performance assessment across multiple programming paradigms. Results demonstrate significant performance variations across different problem categories, with traditional algorithmic challenges showing notably higher success rates compared to complex, multi-file engineering tasks. The benchmark reveals substantial gaps in current capabilities for handling real-world software engineering scenarios, particularly in areas requiring deep contextual understanding, cross-file reasoning, and integration with existing large-scale systems. This work contributes a more comprehensive and realistic evaluation framework for assessing software engineering capabilities, providing insights into current limitations and establishing a foundation for future development in automated software engineering tools and methodologies.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 21730

Loading