GameDevBench: Evaluating Agentic Capabilities Through Game Development
Keywords: Games, Agents, Benchmark, Evaluation, LLM, LM Agent
TL;DR: GameDevBench is the first benchmark for evaluating an LM agent's ability to develop games using a game engine.
Abstract: Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind.
A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding.
In game development, agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene.
We present GameDevBench, the first benchmark for evaluating agents on game development tasks.
GameDevBench consists of 358 derived from web and video tutorials.
Tasks require significant multimodal understanding and are complex---the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks.
Agents struggle with game development, with the best baseline agent solving only $49.0\%$ of tasks.
We find a strong correlation between perceived task difficulty and multimodal complexity, with the average success rate dropping from $56.1\%$ on gameplay-oriented tasks to $37.0\%$ on 2D graphics tasks.
To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents.
Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5’s performance from $34.4\%$ to $44.7\%$ when given video feedback.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 229
Loading