Grounding Generated Videos in Feasible Plans via World Models

Christos Ziakas; Amir Bar; Alessandra Russo

Grounding Generated Videos in Feasible Plans via World Models

Christos Ziakas, Amir Bar, Alessandra Russo

Published: 02 Mar 2026, Last Modified: 17 Apr 2026ICLR 2026 Workshop World ModelsEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Generative Models, World Models

TL;DR: GVP-WM projects video-generated plans onto dynamically feasible latent trajectories under a learned world model, while preserving semantic alignment with the video guidance.

Abstract: Large-scale video generative models have shown emerging capabilities as zero-shot visual planners, yet video-generated plans often violate temporal consistency and physical constraints, leading to failures when mapped to executable actions. To this end, we propose Grounding Video Plans with World Models (GVP-WM), a planning method that grounds video-generated plans into feasible action sequences using a pre-trained action-conditioned world model. At test-time, GVP-WM first generates a video plan from initial and goal observations, then projects the video guidance onto the manifold of dynamically feasible latent trajectories via video-guided latent collocation. In particular, we formulate grounding as a goal-conditioned latent-space trajectory optimization problem that jointly optimizes latent states and actions under world-model dynamics, while preserving semantic alignment with the video-generated plan. Empirically, GVP-WM recovers feasible long-horizon plans from zero-shot image-to-video–generated and motion-blurred videos that violate physical constraints, across navigation and manipulation simulation tasks.

Submission Number: 44

Loading