LAVITA: Latent Video Diffusion Models with Spatio-temporal Transformers

Abstract

Video generation is a challenging task as it requires effective modeling of rich spatio-temporal information from high-dimensional video data. To tackle this challenge, we propose a novel architecture, the LAtent VIdeo diffusion model with spatio-temporal TrAnsformers, referred to as LAVITA, which integrates the Transformer architecture into diffusion models for the first time within the realm of video generation. Conceptually, LATIVA models spatial and temporal information separately to accommodate their inherent disparities as well as to reduce the computational complexity. Following this design strategy, we design several Transformer-based model variants to integrate spatial and temporal information harmoniously. Moreover, we identify the best practices in architectural choices and learning strategies for LAVITA through rigorous empirical analysis. Our comprehensive evaluation demonstrates that LAVITA achieves state-of-the-art performance across several standard video generation benchmarks, including FaceForensics, SkyTimelapse, UCF101, and Taichi-HD, outperforming current best models. We strongly believe that LAVITA provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.

Unconditional generation

Unconditional video generation on the Taichi-HD (256 x 256), FaceForensics (256 x 256) and SkyTimelapse (256 x 256) datasets.


Conditional generation based on classes

Given the class, LAVITA is able to generate the desired videos. Results are shown on the UCF101 (256 x 256) datasets.

Conditional generation based on prompts.

Results are shown by using LAVITA to generate disered videos. Results are shown on the Webv2m datasets and a subset of Laion5B (comprising approximately 6,400,000 images).

Decorate with pineapple sweet cake roll.

Reeds in the wind, razim lake, romania.

Slow pan upward of blazing oak fire in an indoor fireplace.

Flight over the country.

Sunset over the sea.

Compare with other state-of-the-arts.

Visual comparison with other state-of-the-arts on UCF101, Taichi-HD, FaceForensics and SkyTimelapse datasets, respectively.

UCF101

PVDM

Ours

Taichi-HD

DIGAN

PVDM

Ours

FaceForensics

StyleGAN-V

PVDM

Ours

SkyTimelapse

StyleGAN-V

PVDM

Ours