PokéChamp: an Expert-level Minimax Language Agent

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: PokéChamp, an LLM-powered AI agent for Pokémon battles, outperforms existing bots and achieves expert-level performance against human players using minimax search with LLM-based action sampling, opponent modeling, and value calculation.
Abstract: We introduce PokéChamp, a minimax agent powered by Large Language Models (LLMs) for Pokémon battles. Built on a general framework for two-player competitive games, PokéChamp leverages the generalist capabilities of LLMs to enhance minimax tree search. Specifically, LLMs replace three key modules: (1) player action sampling, (2) opponent modeling, and (3) value function estimation, enabling the agent to effectively utilize gameplay history and human knowledge to reduce the search space and address partial observability. Notably, our framework requires no additional LLM training. We evaluate PokéChamp in the popular Gen 9 OU format. When powered by GPT-4o, it achieves a win rate of 76\% against the best existing LLM-based bot and 84\% against the strongest rule-based bot, demonstrating its superior performance. Even with an open-source 8-billion-parameter Llama 3.1 model, PokéChamp consistently outperforms the previous best LLM-based bot, Pokéllmon powered by GPT-4o, with a 64\% win rate. PokéChamp attains a projected Elo of 1300-1500 on the Pokémon Showdown online ladder, placing it among the top 30\%-10\% of human players. In addition, this work compiles the largest real-player Pokémon battle dataset, featuring over 3 million games, including more than 500k high-Elo matches. Based on this dataset, we establish a series of battle benchmarks and puzzles to evaluate specific battling skills. We further provide key updates to the local game engine. This work establishes Pokémon as a benchmark to integrate LLM technologies with game-theoretic algorithms addressing general multi-agent problems. Videos, code, and dataset are available online.
Lay Summary: PokéChamp is an artificial-intelligence player that battles in the popular strategy game Pokémon. Instead of being hand-coded with rigid rules, PokéChamp taps into the broad knowledge already stored inside large language models—the same technology that powers advanced chatbots. We give the language model three jobs: suggest promising moves, guess what the opponent might do next, and judge which future positions look best. With those ingredients it “thinks ahead” much like a skilled human, but without any extra training specific to Pokémon. We tested PokéChamp in the game’s most competitive online format (Generation 9 OverUsed). Using the state-of-the-art GPT-4o model, it won about 80% of matches against the strongest existing bots and reached an Elo rating in the top 10–30% of human players on the public ladder. Even when we swapped in a much smaller, openly available model, PokéChamp still beat the previous best language-model bot most of the time. To support research beyond our own agent, we collected and cleaned the largest public Pokémon battle dataset so far—over three million matches, including half a million high-level games. From this trove we built new skill puzzles and benchmarks and also patched the open-source game engine so future systems can be tested more reliably. By showing that language models can guide strategic search in a fast, partially hidden, multi-step game, PokéChamp offers a template for building versatile AI teammates and opponents in many other competitive settings.
Link To Code: https://github.com/sethkarten/pokechamp
Primary Area: Reinforcement Learning->Multi-agent
Keywords: multi-agent systems, LLM Agents, competitive games, partially observable, test-time compute, pokemon
Flagged For Ethics Review: true
Submission Number: 12489
Loading