MMGR: Multi-Modal Generative Reasoning Benchmarking World Models for Language-Grounded Agents

Zefan Cai; Haoyi Qiu; Tianyi Ma; Haozhe Zhao; Gengze Zhou; Kung-Hsiang Huang; Parisa Kordjamshidi; Minjia Zhang; Wen Xiao; Jiuxiang Gu; Nanyun Peng; Junjie Hu

MMGR: Multi-Modal Generative Reasoning Benchmarking World Models for Language-Grounded Agents

Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Wen Xiao, Jiuxiang Gu, Nanyun Peng, Junjie Hu

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: Video Generation Evaluation

Abstract: Video foundation models have made striking progress in synthesizing visually compelling and temporally coherent content; yet, their viability as \emph{world simulators} hinges on whether they internalize the physical, logical, and spatial constraints that govern reality. While traditional metrics largely emphasize perceptual fidelity, emerging research highlights a critical gap: current models often hallucinate violations of causal structure and lack the long-horizon spatial memory required for consistent planning. To address this, we propose a principled evaluation framework grounded in five core reasoning abilities: \textbf{Physical}, \textbf{Logical}, \textbf{3D Spatial}, \textbf{2D Spatial}, and \textbf{Temporal} reasoning. To distinguish true reasoning from statistical mimicry, we construct \textbf{MMGR} (\underline{M}ulti-\underline{M}odal \underline{G}enerative \underline{R}easoning Evaluation and Benchmark). Crucially, MMGR introduces a dataset construction paradigm focused on answer-verifiable tasks, ensuring high reproducibility and eliminating the ambiguity of subjective metrics by requiring that correct visual outputs be strictly determined by underlying logical rules. This construction enforces a process-oriented evaluation standard, demanding that models demonstrate chain-of-frame reasoning—where intermediate frames serve as necessary causal links in a deductive chain. We benchmark state-of-the-art video generation models—including \textbf{Veo-3}, \textbf{Sora-2}, and \textbf{Wan-2.2}—alongside leading image generation models such as \textbf{Nano-banana}, \textbf{Nano-banana Pro}, \textbf{GPT-4o-image}, and \textbf{Qwen-image}, revealing a pronounced performance asymmetry across modalities. While current models achieve moderate success on Physical Commonsense tasks, they fail catastrophically on Abstract Reasoning (achieving $<10\%$ accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Through detailed quantitative analysis and human evaluation, we identify key limitations in existing training paradigms: a severe imbalance favoring perceptual data over symbolic reasoning, architectural weaknesses in maintaining global state consistency, and optimization objectives that reward visual plausibility over causal correctness. By unifying abstract logic, embodied interaction, and intuitive physics under a single evaluation framework, MMGR provides a diagnostic lens into the reasoning deficits of modern generative models and outlines a concrete roadmap toward \textbf{physically grounded, logically consistent, and reasoning-aware world models}.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 132

Loading