Is Your LLM a Good Game Master? A Game-Based Framework for Evaluating LLM Creativity and Reasoning

Is Your LLM a Good Game Master? A Game-Based Framework for Evaluating LLM Creativity and Reasoning

ACL ARR 2026 January Submission6968 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Agents, Large language models, Game theory

Abstract: In this paper, we introduce a novel evaluation framework for assessing Large Language Model (LLM) capabilities through the Game Master paradigm -- where the LLM generates and orchestrates complex multi-agent games for AI players with distinct personalities. The framework comprises (a) a comprehensive game generation and evaluation system spanning 18 game types across 6 categories (strategy, negotiation, cooperative, competition, auction/resource, and narrative), and (b) a personality-based player model utilizing the Big Five (OCEAN) framework with critical evaluator archetypes designed to prevent lenient assessment bias. To our knowledge, this is the first attempt to systematically evaluate LLM's emergent capabilities -- creativity, logical reasoning, fairness, and narrative coherence -- through fully automatic game-based assessment. Experiments with GPT-4.1 demonstrate only 13.0% overall approval in 162 games, with cooperative games achieving 44.1% approval while strategy games fall to 2.2 % -- a 20 times performance gap. These results indicate that GPT-4.1 struggles with balanced competitive game design while excelling at cooperative narratives.

Paper Type: Short

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, evaluation methodologies, evaluation, metrics, NLP datasets, LLM agents

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Submission Number: 6968

Loading