Keywords: Agents, Large language models, Game theory
Abstract: In this paper, we introduce a novel evaluation framework for assessing Large Language Model (LLM) capabilities through the Game Master paradigm -- where the LLM generates and orchestrates complex multi-agent games for AI players with distinct personalities. The framework comprises (a) a comprehensive game generation and evaluation system spanning 18 game types across 6 categories (strategy, negotiation, cooperative, competition, auction/resource, and narrative), and (b) a personality-based player model utilizing the Big Five (OCEAN) framework with critical evaluator archetypes designed to prevent lenient assessment bias. To our knowledge, this is the first attempt to systematically evaluate LLM's emergent capabilities -- creativity, logical reasoning, fairness, and narrative coherence -- through fully automatic game-based assessment. Experiments with GPT-4.1 demonstrate only 13.0% overall approval in 162 games, with cooperative games achieving 44.1% approval while strategy games fall to 2.2 % -- a 20 times performance gap. These results indicate that GPT-4.1 struggles with balanced competitive game design while excelling at cooperative narratives.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, evaluation methodologies, evaluation, metrics, NLP datasets, LLM agents
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 6968
Loading