Keywords: game evaluations; meta-reasoning
TL;DR: Reasoning is not just about solving new problems, but deciding what problems to solve in the first place.
Abstract: Reasoning is not just about solving problems---it is also about evaluating which problems are worth solving at all. To date, evaluation of artificial intelligence (AI) systems has focused primarily on how they solve problems, often by focusing on how models play games. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. We leverage a large-scale dataset of over 100 novel board games and hundreds of human judgments to compare evaluations produced by language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff or fairness of games and assessing the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. We find that reasoning models are generally more aligned to people in their evaluations of games. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We observe more ``jaggedness'' across models for assessing funness, in line with the greater difficulty of quantifying this query.
Submission Number: 84
Loading