KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

Jiajun Shi; Jian Yang; Jiaheng Liu; Xingyuan Bu; Jiangjie Chen; Junting Zhou; Kaijing Ma; Zhoufutu Wen; Bingli Wang; Yancheng He; Liang Song; Hualei Zhu; Shilong Li; Xingjian Wang; Wei Zhang; Ruibin Yuan; Yifan Yao; Wenjun Yang; Yunli Wang; Siyuan Fang; Siyu Yuan; Qianyu He; Xiangru Tang; Yingshui Tan; Wangchunshu Zhou; Zhaoxiang Zhang; Zhoujun Li; Wenhao Huang; Ge Zhang

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

Published: 18 Sept 2025, Last Modified: 14 Dec 2025NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM; Evaluation; RL; Game

TL;DR: We propose KORGym, a dynamic, game‐based benchmark offering over 50+ interactive tasks with RL support for multi‐turn LLM reasoning evaluation, and validate its effectiveness through extensive experiments, revealing several key insights.

Abstract: Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM’s general reasoning potential. To address this limitation, we introduce the **Knowledge Orthogonal Reasoning Gymnasium (KORGym)**, a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 14317

Loading

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

Jiajun Shi, Jian Yang, Jiaheng Liu, Xingyuan Bu, Jiangjie Chen, Junting Zhou, Kaijing Ma, Zhoufutu Wen, Bingli Wang, Yancheng He, Liang Song, Hualei Zhu, Shilong Li, Xingjian Wang, Wei Zhang, Ruibin Yuan, Yifan Yao, Wenjun Yang, Yunli Wang, Siyuan Fang et al. (9 additional authors not shown)