Large Language Models as Gaming Agents

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Language model, agent, reasoning, decision making
Abstract: Although recent works have demonstrated that Large Language Models (LLMs) are starting to excel at following human instructions, their strategic thinking, planning, and long-term decision-making skills remain unclear. To rigorously evaluate these capabilities, we propose leveraging strategic gaming environments as they provide well-defined, structured benchmarks with clear success criteria. Specifically, we adopt multiple popular reasoning-oriented autonomous agents and analyze their performances over two popular strategic gaming environments: Tic-Tac-Toe (one of the most popular complete information gaming) and Texas Hold’em Poker (one of the most popular incomplete information games). To our surprise, we find that even one of the most advanced LLMs, ChatGPT, is largely ineffective in these two gaming scenarios. An even more surprising finding is that state-of-the-art reasoning methods, e.g., Chain-of-Thought, ReAct, etc., do not help much. For instance, in the naive 3×3 Tic-Tac-Toe environment, nearly all agents only perform slightly better than the random agent, i.e., the agent that randomly selects an action at each step. To understand this failure mode in more depth, we carry out a detailed demographic analysis. Our analysis uncovers two potential reasons behind this weakness: 1) autonomous agents lack gaming intents, i.e., they cannot “think ahead” to defend against opponents’ future moves; 2) LLMs suffer from severe hallucinations and factual errors, e.g., even advanced reasoning agents fail to recognize immediate win/lose situations. With these insights, we take the first step to propose a simple yet effective Think Ahead Language-powered Gaming Agent (TALAGA). TALAGA recursively thinks ahead of the opponent’s move, evaluates current gaming situations, and adjusts action selection with reward signal backtrack. We further empower TALAGA with additional features to alleviate hallucinations and factual errors, such as uncertainty estimation. Experimental results demonstrate that TALAGA significantly outperforms existing reasoning autonomous agents. A broader implication of our exploration is that games can serve as stress tests for LLMs, pushing them to their limits and uncovering vulnerabilities or weaknesses. We hope that this paper sheds new light on the limitations of current autonomous reasoning agents, which, in turn, would help with model improvements and achieve greater robustness.
Supplementary Material: pdf
Primary Area: applications to robotics, autonomy, planning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5875
Loading