LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models

Text-to-Video Generation

(Click image to play video)

Long Video Generation

(Click image to play video)

Personalized T2V Generation

(Click image to play video)

Training image samples

Vimeo25M samples

(Click image to play video)

Comparison (Joint image-video fine-tuning)

(Click image to play video)

Comparison (RoPE)

(Click image to play video)

Abstract

We present LaVie, an innovative system for text-to-video (T2V) generation that operates on the foundation of a cascade of video latent diffusion models. Comprising three networks, namely a base T2V model, a temporal interpolation model, and a video super-resolution model, LaVie aims to address research question of extending a pre-trained text-to-image (T2I) model into realm of video synthesis. Our objective is to accomplish the synthesis of visually realistic and temporally coherent videos while preserving the strong compositional nature of the model. We found that the incorporation of simple temporal self-attentions, coupled with relative positional encoding, adequately captures the temporal correlations inherent in video data. Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality outcomes. To enhance the performance of LaVie, we curate a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Experimental evaluations demonstrate that LaVie outperforms state-of-the-art in terms of quantitative and qualitative assessments. Furthermore, we showcase the versatility of pre-trained LaVie models in long video generation and personalized video synthesis applications.