Keywords: world model, robot planning, 4D generation
TL;DR: We propose a 4D latent world model which generates dynamic 3D structures and benefits robot planning.
Abstract: Learned world models are emerging as a powerful paradigm in robotics, offering a promising path toward task generalization, long-horizon planning, and flexible decision-making. However, prevailing approaches often operate on 2D video sequences, inherently lacking the 3D geometric understanding necessary for precise spatial reasoning and physical consistency. Recent work has begun to inject 3D signals into video world models (e.g., depth and normals), improving spatial understanding but still operating on surface-level projections that can struggle under occlusion and viewpoint changes. To overcome this limitation, we introduce the *4D Latent World Model*, which learns to predict the evolution of a scene's 3D structure within a structured sparse voxel latent space, conditioned on observations and textual instructions. The latent space encodes the scene holistically and can be decoded into diverse 3D formats (e.g., 3D Gaussian Splatting), enabling a more complete and physically consistent scene understanding. This 4D latent world model serves as a planner, generating future scenes that are translated into executable actions by a goal-conditioned inverse dynamics model. Experiments demonstrate that our model generates futures with superior visual quality, physical consistency, and multi-view coherence compared to state-of-the-art video-based planners. Consequently, our full planning pipeline achieves superior performance on complex manipulation tasks, exhibits robust generalization to novel visual conditions, and proves effective on real-world robotic platforms.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 4553
Loading