Shepherd: Pattern-Guided Trajectory Selection for Coding Agents on SWE-Bench

Shepherd: Pattern-Guided Trajectory Selection for Coding Agents on SWE-Bench

ICLR 2026 Conference Submission14695 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, agents, agentic systems, OpenHands, SWE-bench, LLM-as-a-judge

Abstract: Despite major improvements in LLM coding agents, their performance on complex software engineering tasks is still limited—leading models to solve only about half of the software engineering tasks in benchmarks like SWE-bench. This gap highlights the need to systematically understand why coding agents fail. We comprehensively analyze coding agent failure patterns in 18 state-of-the-art open- and closed-weight models. Through meticulous examination of 3,908 execution trajectories, we identify three distinct failure patterns: (1) \fa{}, where agents fail to interact with the environment; (2) \oo{}, where agents issue interdependent actions simultaneously rather than sequentially; and (3) \ft{}, where agents prematurely assume task completion. Using these failure patterns, we introduce \judge{}, a test-time steering mechanism that leverages an LLM-as-a-judge framework to evaluate trajectories. \judge{} shows a strong monotonic correlation with expert annotations and can effectively identify problematic patterns in agent behavior. When applied to select optimal trajectories from multiple runs, \judge{} significantly improves performance, increasing o1-low from 21\% to 31\% on SWE-bench Verified, outperforming the more expensive o1-high model (29\%) at 57\% of the cost. We open-source our comprehensive dataset of trajectories to facilitate further research on improving coding agent capabilities.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 14695

Loading