Keywords: Tool use, LLM agent, environment, test-time scaling
Abstract: The ability to use tools is fundamental to large language model (LLM) agents. However, when solving complex tasks, current LLMs are prone to incorrect tool selection and invalid tool-call arguments. Although letting LLMs iteratively refine the tool-call sequence using execution results from real tools can help, repeated testing on real tools can be expensive and lead to unintended side effects. To improve LLM tool calls while addressing the issues caused by using real tools for refinement, we introduce Gecko, an environment that simulates tool execution results using a combination of rules and LLMs. Specifically, Gecko checks the validity of tool calls including input arguments and tool names, synthesizes reasonable responses that adhere to the output schema, and assesses whether all task objectives have been achieved. Such feedback provided by Gecko allows LLMs to refine their tool calls, forming a simple yet effective test-time scaling method named GATS. In addition, we design an automated API schema converter so that Gecko can quickly integrate and simulate a large number of tools. On BFCL and $\tau^2$-bench, our test-time scaling method GATS enabled by Gecko consistently improves tool calling performance of existing LLMs including GPT-4o and GPT-5 and yields new state of the art. We further discuss working mechanisms of our method and share rosy future possibilities.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 540
Loading