VideoDiT: Bridging Image Diffusion Transformers for Streamlined Video Generation

ICLR 2025 Conference Submission325 Authors

13 Sept 2024 (modified: 13 Oct 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Text-to-Video Generation, Diffusion Models, Image Diffusion Transformer
TL;DR: We introduce VideoDiT, a framework that integrates a Distribution-Preserving VAE and 3D Diffusion Transformers into pre-trained T2I models, enabling efficient joint image-video training and high-quality synthesis with minimal additional parameters.
Abstract: We present VideoDiT, a streamlined video generation framework adapted from pre-trained image generation models. Unlike previous methods that simply add temporal layers to image diffusion models, we enhance both the tokenizer, implemented with the variational autoencoder (VAE), and the diffusion model. We emphasize the importance of combining 3D VAE compression with knowledge from pre-trained image diffusion models to achieve efficient video generation, though the tight coupling between image diffusion models and 2D VAEs poses significant challenges. To address this, we introduce the Distribution-Preserving VAE (DP-VAE), which encodes key frames in a video clip using the original 2D VAE while compressing non-key frames with a 3D VAE for spatiotemporal modeling. A regularization term ensures alignment between the 3D video latent space and the 2D image latent space, facilitating seamless transfer of pre-trained diffusion models. Leveraging the Diffusion Image Transformers (DiT) architecture and incorporating 3D positional embeddings, we extend 2D attention into 3D with negligible increased parameters. Furthermore, leveraging our proposed DP-VAE, VideoDiT supports joint image-video training, preserving the spatial modeling capabilities of the base model while excelling in both image and video generation. Extensive experiments validate the effectiveness of our approach.
Supplementary Material: zip
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 325
Loading