Abstract: Recent advancements in visual generative models have substantially broadened the capabilities of scene synthesis across modalities such as video, 3D, and 4D environments which have significantly enhanced the application in various domains. Despite this progress, most existing systems treat scenes in isolation, lacking long-range spatial-temporal coherence and interactive control mechanisms. These shortcomings lead to the lack of interactivity and composability, limiting their potential in scenarios such as immersive entertainment and education. To address this, we introduce DreamGen, a novel unified framework designed to transform a single panoramic image into a fully interactive, panoramic 4D world. DreamGen operates through an integrated three-stage pipeline: First, it achieves view-consistent 3D reconstruction via Gaussian Splatting, employing monocular depth estimation and diffusion-based inpainting to enrich and complete the scene; next, it simulates continuous camera trajectories to ensure geometric and temporal consistency; finally, it combines these outputs within a real-time, event-driven Supersplat renderer to facilitate dynamic editing and immersive exploration. Extensive experiments on the comprehensive WorldScore benchmark demonstrate DreamGen's superior performance, outperforming existing state-of-the-art methods in controllability, visual fidelity, and motion dynamics. Our approach not only establishes new standards in interactive and coherent 4D world generation but also opens promising avenues for applications in immersive entertainment, embodied AI, and advanced simulation scenarios.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Text Image-to-4D Generation
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 5255
Loading