Evaluating Spatial World Modeling in Video \\ Generators via 3D Camera Trajectory Generation

Published: 26 May 2026, Last Modified: 26 May 2026ICML 2026 FoGen Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: World Model, Generative Planning, Video Diffusion Model
TL;DR: Our paper shows a clear gap between video models' spatial perception and spatial reasoning: they can learn local cues such as free space and obstacles, but they struggle to use them for room-level planning.
Abstract: Recent progress in world models raises a central question: can video generators, as candidate world models, reason about spatial structure rather than only produce plausible motion? Existing evaluations often miss key spatial functions or test them in simplified settings such as mazes, grids, and toy motion patterns. To address this gap, we introduce 3D Camera Trajectory Creation, a floor-plan-conditioned task where a model must generate both a plan-style video and a camera pose sequence under indoor structural constraints. We build two datasets for this task: a single-target dataset where each path visits only one item and a more realistic multi-task dataset for long tour-like behavior. We introduce a score engine that measures trajectory quality score for diagnostic evaluation. Our analysis shows that video generators learn visually regular spatial cues, especially local free-space perception. However, this ability does not reliably compose into room-level planning. Models still struggle with doorway traversal, correct-room selection, target grounding, and target-facing orientation, suggesting partial spatial world-model behavior without a reliable topological abstraction of space.
Submission Number: 138
Loading