Keywords: language models, benchmarks, reasoning, environments, games
TL;DR: We present a pipeline for generating games that can be used to evaluate LLMs
Abstract: We present gg-bench, a collection of generated game environments designed to evaluate the reasoning capabilities of language models. gg-bench is synthetically generated by (1) using an LLM to write game descriptions in natural language, (2) using the same LLM to implement each game in code, and (3) training RL agents via self-play on the generated games. We evaluate models based on their winrate against these RL agents by prompting them with the game description, current board state, and a list of valid moves, after which models output the moves they wish to take. gg-bench is challenging: general-purpose LLMs (GPT-4o, Claude 3.7 Sonnet) achieve winrates of 7-9% on gg-bench using in-context learning, while reasoning models (o1, o3-mini, DeepSeek-R1) achieve average winrates of 31-36%. Additionally, because gg-bench is a data generating process rather than a static benchmark, new evaluation instances can be created at will. We release the generated games, data generation process, and evaluation code in order to support future modeling work and expansion of our benchmark.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18624
Loading