Keywords: Evaluation, Coding Agents
Abstract: Real software development is cumulative: changes persist, tests accumulate, and every new line of code becomes a dependency for future work. Yet most coding-agent benchmarks evaluate isolated tasks from clean repository states, missing temporal and stateful pressures that arise when agents are used repeatedly on the same codebase. Thus, to capture this dynamic of software engineering, we introduce SWE-STEP, a stateful and temporal evaluation framework. Instead of treating tasks as independent, SWE-STEP evaluates agents across continuous, temporally ordered pull requests (PRs) to measure not just immediate functional correctness, but long-term repository health. We instantiate this framework with SWE-STEP-Full - comprising 168 tasks and 963 PRs across six Python repositories - and test agents in both sequential (conversational coding) \& bundled workflows (requirements are provided upfront). Our experiments reveal that stateless, isolated evaluations (e.g., SWE-Bench) overestimate agent capabilities by up to 20 percentage points by masking spillover effects of past mistakes. Furthermore, we find that even when agents pass functional tests, they steadily degrade repository health by introducing higher cognitive complexity \& technical debt than human developers.
Submission Number: 112
Loading