Mastering Board Games by External and Internal Planning with Language Models

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We pre-trained an LLM capable of playing board games at a high level. We further introduce external and internal planning methods that achieve Grandmaster-level performance in chess while operating closer to the human search budget.
Abstract: Advancing planning and reasoning capabilities of Large Language Models (LLMs) is one of the key prerequisites towards unlocking their potential for performing reliably in complex and impactful domains. In this paper, we aim to demonstrate this across board games (Chess, Fischer Random / Chess960, Connect Four, and Hex), and we show that search-based planning can yield significant improvements in LLM game-playing strength. We introduce, compare and contrast two major approaches: In *external search*, the model guides Monte Carlo Tree Search (MCTS) rollouts and evaluations without calls to an external game engine, and in *internal search*, the model is trained to generate in-context a linearized tree of search and a resulting final choice. Both build on a language model pre-trained on relevant domain knowledge, reliably capturing the transition and value functions in the respective environments, with minimal hallucinations. We evaluate our LLM search implementations against game-specific state-of-the-art engines, showcasing substantial improvements in strength over the base model, and reaching Grandmaster-level performance in chess while operating closer to the human search budget. Our proposed approach, combining search with domain knowledge, is not specific to board games, hinting at more general future applications.
Lay Summary: 1. Large Language Models (LLMs) demonstrate impressive performance across various tasks that require complex reasoning. Yet, they still struggle to play board games as simple as tic-tac-toe. 2. We developed an LLM that can play different board games, reaching Grandmaster-level chess performance. We investigated different planning strategies that enable the LLM to improve its performance, the more “thinking time” we provide to the model. 3. In the future, similar planning strategies can unlock strong performance improvements in LLMs applied to other reasoning problems.
Primary Area: Deep Learning->Large Language Models
Keywords: search, planning, language models, games, chess
Submission Number: 12369
Loading