Strategic Reasoning in Large Language Models

Published: 15 Mar 2026, Last Modified: 15 Mar 20262026 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Strategic reasoning, large language models, game theory, benchmarking, incomplete-information games, poker decision making, context effects, reasoning evaluation, multi-agent systems, game-theoretic analysis
TL;DR: An empirical benchmark evaluating frontier LLMs on game-theoretic reasoning in complete and incomplete information games, with and without additional context.
Abstract: Large Language Models (LLMs) are increasingly deployed in settings where success depends on strategic reasoning under incentives, hidden information, and interaction with other agents. We present an empirical benchmark of strategic reasoning across nine frontier models: GPT-5.4, Claude Opus 4.6, Claude 4.6 Sonnet, Gemini 3.1 Pro, Gemini 3.1 Flash Lite, DeepSeek V3.2, Grok 4.2, Kimi K2.5, and GLM-5. The evaluation covers four canonical games spanning complete and incomplete information: Prisoner’s Dilemma, Ultimatum Game, Kuhn Poker, and Texas Hold’em, under two conditions: without additional context and with additional context. We score both decision quality and reasoning quality, and aggregate them into a Strategic Reasoning Index (SRI). The results show strong heterogeneity across models: several systems are reliable on complete-information games, whereas Texas Hold’em arithmetic and multi-street decision problems remain the main stress test; additional context can improve execution for some models while compressing or distorting reasoning in others. These findings support a more diagnostic view of strategic competence and highlight the importance of output reliability in action-taking pipelines.
Submission Number: 30
Loading