EnterpriseBench: Benchmarking LLM Agents on Enterprise-Level Strategic Reasoning and Decision-Making

ACL ARR 2026 January Submission9641 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model Agent, Complex Reasoning, Decision-making, Benchmark
Abstract: As Large Language Model (LLM) agents demonstrate increasingly strong reasoning capabilities, they are being progressively adopted to support complex decision-making in enterprise environments. However, existing benchmarks primarily evaluate reasoning through static, single-shot tasks with well-defined objectives and immediate correctness, whereas real-world decision-making is inherently interactive, open-ended, and shaped by delayed consequences and competing goals. To address this challenge, we introduce EnterpriseBench, a comprehensive benchmark for assessing LLM agents in realistic enterprise contexts. EnterpriseBench covers a hierarchy of reasoning demands, ranging from information extraction, numerical and domain knowledge reasoning to high-fidelity interactive decision-making, including management consulting cases and serious games. We also introduce an agent-oriented taxonomy that organizes tasks by capability domains and intrinsic difficulty. Empirical evaluations across nine state-of-the-art agent architectures reveal a natural division of labor, where simpler tasks favor lightweight models and complex decision-making benefits from stronger reasoning agents. Building on this insight, we present an Agent-of-Agents framework that integrates the complementary capabilities of such specialized agents. Our codes and benchmark implementation are publicly available.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, evaluation methodologies, evaluation
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 9641
Loading