SWE-eval: Trajectory-Enhanced Evaluation for Agentic Issue Resolution

Tongwei Deng; Xutian Li; Runbang Yan; Wanning Li; Yifeng Zhu; Yue Wang; Jinyu Zhu; Yanzhen Zou; Zexiong Ma; Zhixuan Liu; Bing Xie

SWE-eval: Trajectory-Enhanced Evaluation for Agentic Issue Resolution

Tongwei Deng, Xutian Li, Runbang Yan, Wanning Li, Yifeng Zhu, Yue Wang, Jinyu Zhu, Yanzhen Zou, Zexiong Ma, Zhixuan Liu, Bing Xie

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Language models, Natural language processing, Software engineering

Abstract: Agents and Language Models (LMs) demonstrate significant advancements in software engineering, particularly in issue resolution. Current benchmarks can qualitatively assess the correctness of generated patches. However, they lack mechanisms for quantitatively evaluating the trajectory, which is important to reveal the point of improvement. To obtain understanding of issue-resolving agents' working processes, we propose SWE-eval, a trajectory-augmented evaluation framework. SWE-eval additionally assesses a coding agent's reasoning trajectory across three dimensions: (1) Efficiency, measured by resource consumption; (2) Logical Consistency, where Intra-turns measures the logical consistency within a single turn and Inter-turns measures logical consistency across multiple conversation turns; (3) Tool Utilization, for which we design a metric Info-gain to assess how much new information the tool provides for solving problems. Our experiments on three agents and nine LMs demonstrate that SWE-eval effectively reveals underlying interpretations of agent performance and can guide development of more effective agents. First, our evaluations show that elevating trajectory-aware metrics is crucial for enhancing the % Resolved. Second, we trace divergent agent behaviors to shallow exploration, missing backtracking, and loop entrapment. We also show that fine-tuning on agents risks overfitting and scaling LMs improves trajectories. Third, LLM-based evaluations align closely with expert judgments and exhibit consistent stability, serving as reliable proxies.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 6642

Loading