Can LLM Agents Use Tools Economically? Benchmarking Cost-Optimal Planning and Adaptive Replanning in Dynamic Environments

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: benchmarking, evaluation methodologies, evaluation
TL;DR: CostBench benchmarks cost-optimal tool planning and adaptation in dynamic environments, showing that even state-of-the-art LLM agents struggle with economic reasoning.
Abstract: Current evaluations of Large Language Model (LLM) agents primarily measure task completion while overlooking resource efficiency and adaptability. This leaves an important capability insufficiently studied: generating and adjusting cost-optimal plans in changing environments. To address this gap, we introduce **CostBench**, a scalable benchmark for evaluating cost-aware planning and replanning. CostBench is built in a travel-planning setting where tasks can be solved using multiple sequences of atomic and composite tools with configurable costs. It also introduces dynamic blocking events, such as tool failures and cost changes, requiring agents to adapt their plans. Evaluations of leading open-source and proprietary models reveal substantial limitations in cost-aware planning: agents often fail to find optimal solutions even in static settings (with GPT-5 achieving under 75% exact match on the hardest tasks), and performance drops further under dynamic conditions. CostBench provides a foundation for developing agents that are both economically rational and robust.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 130
Loading