BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

ACL ARR 2024 December Submission2129 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the collaboration capabilities of language models. Many benchmarks are proposed to evaluate cooperation and competition abilities of LLMs. However, existing works have overlooked scenarios where cooperation and competition coexist. Additionally, real-world environments require agents to have precise spatial perception abilities, which many existing studies have overlooked. To address these two problems, we propose a benchmark, called BattleAgentBench, which defines seven sub-stages of three varying difficulty levels and conducts a fine-grained evaluation of language models in terms of single-agent scenario navigation capabilities, paired-agent task execution abilities, and multi-agent collaboration and competition capabilities. Experimental results indicate that API-based models perform excellently on simple tasks but open-source small models struggle with simple tasks. Regarding difficult tasks that require collaborative and competitive abilities, although API-based models have demonstrated some collaborative capabilities, there is still enormous room for improvement. The code for BattleAgentBench is available at \url{https://anonymous.4open.science/r/BattleAgentBench-256D}
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: prompting, applications
Contribution Types: Publicly available software and/or pre-trained models
Languages Studied: language, chinese
Submission Number: 2129
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview