BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

ACL ARR 2024 December Submission2129 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the collaboration capabilities of language models. Many benchmarks are proposed to evaluate cooperation and competition abilities of LLMs. However, existing works have overlooked scenarios where cooperation and competition coexist. Additionally, real-world environments require agents to have precise spatial perception abilities, which many existing studies have overlooked. To address these two problems, we propose a benchmark, called BattleAgentBench, which defines seven sub-stages of three varying difficulty levels and conducts a fine-grained evaluation of language models in terms of single-agent scenario navigation capabilities, paired-agent task execution abilities, and multi-agent collaboration and competition capabilities. Experimental results indicate that API-based models perform excellently on simple tasks but open-source small models struggle with simple tasks. Regarding difficult tasks that require collaborative and competitive abilities, although API-based models have demonstrated some collaborative capabilities, there is still enormous room for improvement. The code for BattleAgentBench is available at \url{https://anonymous.4open.science/r/BattleAgentBench-256D}

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: prompting, applications

Contribution Types: Publicly available software and/or pre-trained models

Languages Studied: language, chinese

Submission Number: 2129

Loading