Guided Search Strategies in Non-Serializable Environments with Applications to Software Engineering Agents

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We apply test-time search with learned critics to LLM agents even in environments where it's not possible to fork a state by following multiple actions from it.
Abstract: Large language models (LLMs) have recently achieved remarkable results in complex multi-step tasks, such as mathematical reasoning and agentic software engineering. However, they often struggle to maintain consistent performance across multiple solution attempts. One effective approach to narrow the gap between average-case and best-case performance is guided test-time search, which explores multiple solution paths to identify the most promising one. Unfortunately, effective search techniques (e.g. MCTS) are often unsuitable for *non-serializable* RL environments, such as Docker containers, where intermediate environment states cannot be easily saved and restored. We investigate two complementary search strategies applicable to such environments: 1-step lookahead and trajectory selection, both guided by a learned action-value function estimator. On the SWE-bench Verified benchmark, a key testbed for agentic software engineering, we find these methods to double the average success rate of a fine-tuned Qwen-72B model, achieving $40.8$\%, the new state-of-the-art for open-weights models. Additionally, we show that these techniques are transferable to more advanced closed models, yielding similar improvements with GPT-4o.
Lay Summary: When AI agents tackle complex multi-turn tasks, their performance is often inconsistent — sometimes they produce brilliant solutions, other times fail miserably. To improve reliability, we can use search techniques to explore many possible solution paths before committing to a specific one. However, search can be difficult in environments like Docker containers, where states cannot be easily captured or "rewound", and such environments are important for domains like software engineering. We explore search strategies specifically designed for these challenging "non-serializable" environments. Our solution implements two strategies: looking $1$ step ahead to evaluate potential actions, and generating multiple complete solutions to select the best one. Both are guided by the same trained model estimating which actions will likely succeed. When applied to SWE-bench Verified, a benchmark based on real-world software engineering tasks, our approach doubles the success rate of our AI agent to $40.8$\% — the best among publicly available systems. Our techniques also improve more advanced models like GPT-4o. Our research makes AI software agents more reliable, bringing us closer to practical AI programming assistance.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Deep Learning->Large Language Models
Keywords: RL, Planning, Test-time search, Test-time computation, Agents, LLM, Software Engineering, SWE-bench, Process supervision, Outcome supervision
Submission Number: 12167
Loading