Image as a World: Generating Interactive World from Single Image via Panoramic Video Generation

Published: 31 Dec 2025, Last Modified: 03 Oct 2025NeurIPS 2025EveryoneRevisionsCC BY 4.0
Abstract: Generating an interactive visual world from a single image is both challenging and practically valuable, as single-view inputs are easy to acquire and align well with prompt-driven applications such as gaming and virtual reality. This paper introduces a novel unified framework, Image as a World (IaaW), which synthesizes high-quality 360-degree videos from a single image that are both controllable and temporally continuable. Our framework consists of three stages: world initialization, which jointly synthesizes spatially complete and temporally dynamic scenes from a single view; world exploration, which supports user-specified viewpoint rotation; and world continuation, which extends the generated scene forward in time with temporal consistency. To support this pipeline, we design a visual world model based on generative diffusion models modulated with spherical 3D positional encoding and multi-view composition to represent geometry and view semantics. Additionally, a vision-language model (IaaW-VLM) is fine-tuned to produce both global and view-specific prompts, improving semantic alignment and controllability. Extensive experiments demonstrate that our method produces panoramic videos with superior visual quality, minimal distortion and seamless continuation in both qualitative and quantitative evaluations. To the best of our knowledge, this is the first work to generate a controllable, consistent, and temporally expandable 360-degree world from a single image.
Loading