A Statistical Framework for Game-Based AI Evaluation

Felipe Maia Polo; Leshem Choshen; Yuekai Sun; Kristjan Greenewald

A Statistical Framework for Game-Based AI Evaluation

Felipe Maia Polo, Leshem Choshen, Yuekai Sun, Kristjan Greenewald

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: llm, evaluation, games

Abstract: We introduce a statistical framework for evaluating large language models (LLMs) in two-player games. The model separates premature endings, such as timeouts or repeated invalid moves, from the conditional outcome of win, draw, or loss. Both parts share a low-dimensional skill space for models and games, which lets us capture reliability (avoiding failures) and proficiency (winning valid games). Using the TextArena dataset (57 models, 30 games, about 38k matches including human players), we learn skills that can be used to compare similarity between LLMs' skill profiles, rank models, or predict performance in other tasks such as solving mathematical problems. In sum, our method turns arena outcomes into a structured and interpretable map of model reliability and capability.

Submission Number: 147

Loading