Keywords: Language models, Natural language processing, Software engineering
Abstract: Agents and Language Models (LMs) demonstrate significant advancements in software engineering, particularly in issue resolution. Current benchmarks can qualitatively assess the correctness of generated patches. However, they lack mechanisms for quantitatively evaluating the trajectory, which is important to reveal the point of improvement. To obtain understanding of issue-resolving agents' working processes, we propose SWE-eval, a trajectory-augmented evaluation framework. SWE-eval additionally assesses a coding agent's reasoning trajectory across three dimensions: (1) Efficiency, measured by resource consumption; (2) Logical Consistency, where Intra-turns measures the logical consistency within a single turn and Inter-turns measures logical consistency across multiple conversation turns; (3) Tool Utilization, for which we design a metric Info-gain to assess how much new information the tool provides for solving problems. Our experiments on three agents and nine LMs demonstrate that SWE-eval effectively reveals underlying interpretations of agent performance and can guide development of more effective agents. First, our evaluations show that elevating trajectory-aware metrics is crucial for enhancing the % Resolved. Second, we trace divergent agent behaviors to shallow exploration, missing backtracking, and loop entrapment. We also show that fine-tuning on agents risks overfitting and scaling LMs improves trajectories. Third, LLM-based evaluations align closely with expert judgments and exhibit consistent stability, serving as reliable proxies.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 6642
Loading