Keywords: Search agent, test-time scaling, tool-augmented agent, budget constraint
Abstract: Scaling test-time computation improves performance across different tasks on large language models (LLMs), which has also been extended to tool-augmented agents. For these agents, scaling involves not only "thinking'' in tokens but also "acting'' via tool calls. Unlike tokens in textual reasoning, the number of tool calls directly bounds the agent’s interaction with the external environment. To this end, we study how to scale such agents under explicit tool-call budgets, focusing on web search agents equipped with search and browse tools. We introduce CATS (Cost-effective Agent Test-time Scaling), a budget-aware framework designed for effective and efficient agent scaling. CATS integrates a lightweight budget tracker that provides a continuous signal of remaining resources to the agent's core modules, encouraging budget-aware adaptations in planning and verification. By constraining the number of tool calls and unifying the costs of both token and tool consumption, we analyze the cost–performance scaling behavior in a more controlled manner. Experiments across search-intensive benchmarks show that CATS produces more favorable scaling curves, attaining higher accuracy with fewer tool calls and lower overall cost. Our work advances a cost-conscious design for agent test-time scaling and offers empirical insights toward a more transparent and principled understanding of scaling in tool-augmented agents.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22208
Loading