Scalable Evaluation of Language Models with Generated Games

Scalable Evaluation of Language Models with Generated Games

ICLR 2026 Conference Submission18624 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: language models, benchmarks, reasoning, environments, games

TL;DR: We present a pipeline for generating games that can be used to evaluate LLMs

Abstract: We present gg-bench, a collection of generated game environments designed to evaluate the reasoning capabilities of language models. gg-bench is synthetically generated by (1) using an LLM to write game descriptions in natural language, (2) using the same LLM to implement each game in code, and (3) training RL agents via self-play on the generated games. We evaluate models based on their winrate against these RL agents by prompting them with the game description, current board state, and a list of valid moves, after which models output the moves they wish to take. gg-bench is challenging: general-purpose LLMs (GPT-4o, Claude 3.7 Sonnet) achieve winrates of 7-9% on gg-bench using in-context learning, while reasoning models (o1, o3-mini, DeepSeek-R1) achieve average winrates of 31-36%. Additionally, because gg-bench is a data generating process rather than a static benchmark, new evaluation instances can be created at will. We release the generated games, data generation process, and evaluation code in order to support future modeling work and expansion of our benchmark.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 18624

Loading