MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

ACL ARR 2024 June Submission5326 Authors

16 Jun 2024 (modified: 01 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) have significantly advanced natural language processing, demonstrating exceptional reasoning, tool usage, and memory capabilities. As their applications expand into multi-agent environments, there arises a need for a comprehensive evaluation framework that captures LLMs' reasoning, planning, collaboration, and other social abilities. This work introduces a novel competition-based benchmark framework specifically designed to assess LLMs within multi-agent settings, providing quantitative metrics to evaluate their judgment, reasoning, deception, self-awareness, cooperation, coordination, and rationality. To create diverse environments, we utilize two social deduction games alongside three game theory scenarios. Our frame is fortified with the probabilistic graphic modeling (PGM) method, enhancing the LLMs' capabilities in navigating complex social and cognitive dimensions. We evaluate seven LLMs, quantitatively highlighting a significant capability gap of over threefold between the strongest, GPT-4, and the weakest, Llama-2-70B. It also confirms that our PGM enhancement boosts the abilities of all selected models by an average of 37\%. Our codes are in the anonymous link https://anonymous.4open.science/r/magic_anonym-5366.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: LLM evaluation, Multi-agent System , benchmarking, metrics
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 5326
Loading