RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

ACL ARR 2026 January Submission5110 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Long-horizon planning, Retail simulation benchmark, LLM agents
Abstract: Large Language Model (LLM)-based agents have achieved notable success on short-horizon and highly structured tasks, yet their ability to maintain coherent decision-making over long horizons in dynamic environments remains an open challenge. We introduce RetailBench, a high-fidelity benchmark designed to evaluate long-horizon autonomous decision-making in realistic commercial scenarios, where agents must operate under stochastic demand and evolving external conditions. We further propose the Evolving Strategy & Execution framework, which separates high-level strategic reasoning from low-level action execution, enabling adaptive and interpretable strategy evolution over time. This design is crucial for long-horizon tasks, where non-stationary environments and error accumulation require strategies to be revised at a different temporal scale than action execution. Experiments on seven state-of-the-art LLMs across progressively challenging environments show that our framework improves operational stability and efficiency compared to a Reflection-based baseline. However, performance degrades substantially as task complexity increases, revealing fundamental limitations in current LLMs for long-horizon, multi-factor decision-making.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: AI/LLM Agents
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 5110
Loading