Keywords: generative model, world model, autonomous driving
Abstract: Generative world models excel at synthesizing plausible visual sequences but still fall short in capturing the continuous 4D structure of real environments. We introduce UNICST a unified 4D latent world model that jointly learns Continuous Spatio-Temporal representations with minimal inductive bias, enabling seamless, spatio-temporally coherent video generation. Built on a next-scale latent prediction paradigm, UNICST constructs its 4D latent hierarchy in a coarse-to-fine fashion thus achieving near real-time speeds. This makes it ideally suitable for controllable 4D generation and downstream embodied tasks. Extensive experiments on large-scale driving datasets demonstrate that UNICST outperforms state-of-the-art methods in both visual fidelity and inference latency, establishing a new baseline for practical world modeling in autonomous systems.
Supplementary Material: zip
Submission Number: 11
Loading