Evaluating Language Models' Evaluations of Games

Katherine M. Collins; Cedegao E. Zhang; Graham Todd; Lance Ying; Mauricio Barba da Costa; Ryan Liu; Adrian Weller; Ionatan Kuperwajs; Lionel Wong; Joshua B. Tenenbaum; Thomas L. Griffiths

Evaluating Language Models' Evaluations of Games

Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, Thomas L. Griffiths

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: game evaluations; meta-reasoning

TL;DR: Reasoning is not just about solving new problems, but deciding what problems to solve in the first place.

Abstract: Reasoning is not just about solving problems---it is also about evaluating which problems are worth solving at all. To date, evaluation of artificial intelligence (AI) systems has focused primarily on how they solve problems, often by focusing on how models play games. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. We leverage a large-scale dataset of over 100 novel board games and hundreds of human judgments to compare evaluations produced by language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff or fairness of games and assessing the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. We find that reasoning models are generally more aligned to people in their evaluations of games. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We observe more ``jaggedness'' across models for assessing funness, in line with the greater difficulty of quantifying this query.

Submission Number: 84

Loading