MARSBench: Evaluating Multi-Agent Multi-Turn Strategic Reasoning of Large Language Models and BeyondDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
Abstract: Current logical reasoning benchmarks of Large Language Models (LLMs) primarily focus on single-turn and static environments, such as arithmetic problems. The crucial problem of multi-turn, strategic reasoning is under-explored. We introduce MARSBench, a novel framework to evaluate the multi-turn strategic reasoning of LLMs through text-driven complete- and incomplete-information gaming, e.g., board games (Tic-Tac-Toe, Connect-4) and poker games (Texas Hold'em Poker). MARSBench offers two distinct scenarios: 1) Online Racing, featuring multiple LLMs/agents to facilitate direct competition and comparison; 2) Offline Probing, constructing targeted questions and verified ground truth to evaluate LLMs' strategic behaviors. We show that existing state-of-the-art LLMs and reasoning schemes are largely ineffective for strategic reasoning tasks. For instance, GPT-3.5-turbo with advanced Tree-of-Thought (ToT) is only slightly better than a Random agent in the naive Tic-Tac-Toe. Offline probing indicates that these LLMs suffer from serious hallucinations (e.g., spatial understanding) and weak strategic thinking (e.g., endgame). A recursively thinking-ahead agent is proposed to strengthen the strategic reasoning of LLMs. We hope MARSBench could spur further research and exploration in the multi-turn strategic reasoning of LLMs.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: english
0 Replies

Loading