Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion Models

Jangho Park; Taesung Kwon; Jong Chul Ye

Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion Models

Jangho Park, Taesung Kwon, Jong Chul Ye

17 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: 4D generation

TL;DR: We Introduce Zero4D, a novel approach to generate synchronized multi view videos from a single video using off-the-shelf video diffusion model without any training.

Abstract: Multi-view and 4D video generation have recently emerged as important topics in generative modeling. However, existing approaches face key limitations: they often require orchestrating multiple video diffusion models with additional training, or involve computationally intensive training of full 4D diffusion models—despite limited availability of real-world 4D datasets. In this work, we propose the first training-free 4D video generation method that leverages off-the-shelf video diffusion models to synthesize multi-view videos from a single input video. Our approach consists of two stages. First, we designate the edge frames in a spatio-temporal sampling grid as key frames and synthesize them using a video diffusion model, guided by depth-based warping to preserve structural and temporal consistency. Second, we interpolate the remaining frames to complete the spatio-temporal grid, again using a video diffusion model to maintain coherence. This two-step framework allows us to extend a single-view video into a multi-view 4D representation along novel camera trajectories, while maintaining spatio-temporal fidelity. Our method is entirely training-free, requires no access to multi-view data, and fully utilizes existing generative video models—offering a practical and effective solution for 4D video generation.

Primary Area: generative models

Submission Number: 9038

Loading