Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning

ICLR 2026 Conference Submission17123 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision Language Model, Reasoning, Data Synthesis, Game Playing, Visual Question Answering, Data Sets or Data Repositories, Benchmarks
Abstract: Vision-language reinforcement learning (RL) has primarily focused on narrow domains (e.g. geometry or chart reasoning). This leaves broader training scenarios and resources underexplored, limiting the exploration and learning of Vision Language Models (VLMs) through RL. We find video games inherently provide rich visual elements and mechanics that are easy to verify. To fully use the multimodal and verifiable reward in video games, we propose Game-RL, constructing diverse game tasks for RL training to boost VLMs general reasoning ability. To obtain training data, we propose Code2Logic, a novel approach that adapts game code to synthesize game reasoning task data, thus obtaining the GameQA dataset of 30 games and 158 tasks with controllable difficulty gradation. Unexpectedly, training solely on GameQA would help VLMs obtain better out of domain generalization, demonstrating the value of Game-RL for enhancing VLMs general reasoning. Furthermore, this suggets that RL can lead to generalizable improvements in VLMs' reasoning abilities, and notably, video games may serve as valuable scenarios and resources to bring this generalization.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 17123
Loading