BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems
Abstract: Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the collaboration capabilities of language models. Many benchmarks are proposed to evaluate cooperation and competition abilities of LLMs. However, existing works have overlooked scenarios where cooperation and competition coexist. Additionally, real-world environments require agents to have precise spatial perception abilities, which many existing studies have overlooked. To address these two problems, we propose a benchmark, called BattleAgentBench, which defines seven sub-stages of three varying difficulty levels and conducts a fine-grained evaluation of language models in terms of single-agent scenario navigation capabilities, paired-agent task execution abilities, and multi-agent collaboration and competition capabilities. Experimental results indicate that API-based models perform excellently on simple tasks but open-source small models struggle with simple tasks. Regarding difficult tasks that require collaborative and competitive abilities, although API-based models have demonstrated some collaborative capabilities, there is still enormous room for improvement.
The code for BattleAgentBench is available at \url{https://anonymous.4open.science/r/BattleAgentBench-256D}
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: prompting, applications
Contribution Types: Publicly available software and/or pre-trained models
Languages Studied: language, chinese
Submission Number: 2129
Loading