Keywords: LLMs, agents, agentic systems, OpenHands, SWE-bench, LLM-as-a-judge
Abstract: Despite major improvements in LLM coding agents, their performance on complex software engineering tasks is still limited—leading models to solve only about half of the software engineering tasks in benchmarks like SWE-bench. This gap highlights the need to systematically understand why coding agents fail.
We comprehensively analyze coding agent failure patterns in 18 state-of-the-art open- and closed-weight models. Through meticulous examination of 3,908 execution trajectories, we identify three distinct failure patterns: (1) \fa{}, where agents fail to interact with the environment; (2) \oo{}, where agents issue interdependent actions simultaneously rather than sequentially; and (3) \ft{}, where agents prematurely assume task completion.
Using these failure patterns, we introduce \judge{}, a test-time steering mechanism that leverages an LLM-as-a-judge framework to evaluate trajectories. \judge{} shows a strong monotonic correlation with expert annotations and can effectively identify problematic patterns in agent behavior. When applied to select optimal trajectories from multiple runs, \judge{} significantly improves performance, increasing o1-low from 21\% to 31\% on SWE-bench Verified, outperforming the more expensive o1-high model (29\%) at 57\% of the cost.
We open-source our comprehensive dataset of trajectories to facilitate further research on improving coding agent capabilities.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 14695
Loading