EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

Published: 02 Mar 2026, Last Modified: 05 Mar 2026LLA 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Planning, LLM Agents, Tool Calling, Benchmark
TL;DR: EnterpriseOps-Gym is a new benchmark for testing LLM agents in long-horizon, stateful, multi-tool workflows. We show that even strong models struggle with planning consistency, permissions, and error recovery
Abstract: The rapid evolution of Large Language Models has shifted their role from passive information providers to active agents intended for complex workflows. However, the deployment of a reliable AI worker is stalled by benchmarks that that fail to capture the intricacy of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. To bridge this gap, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. To achieve this fidelity, EnterpriseOps-Gym features a containerized sandbox hosting 164 database tables and 512 functional tools, a scale essential to mimic the search friction and persistent state management of a real-world workplace. The benchmark includes 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT) where reliability is paramount. Our evaluation reveals critical limitations in state-of-the-art models: even the top-performing Claude Sonnet-4.5 achieves only 34.1\% success, struggling significantly with planning consistency, error recovery, and policy constraints. Crucially, we find that providing oracle human plans improves performance by over 30\%, pinpointing strategic reasoning as the primary bottleneck. Furthermore, we observe that agents frequently fail to refuse infeasible tasks, leading to unintended and potentially harmful side effects on the system. These findings indicate that current agents are not yet ready for enterprise deployment. By releasing EnterpriseOps-Gym, we provide a concrete testbed to advance the robustness of agentic planning in professional workflows.
Submission Number: 196
Loading