Video generation is a challenging task as it requires effective modeling of rich spatio-temporal information from high-dimensional video data. To tackle this challenge, we propose a novel architecture, the LAtent VIdeo diffusion model with spatio-temporal TrAnsformers, referred to as LAVITA, which integrates the Transformer architecture into diffusion models for the first time within the realm of video generation. Conceptually, LATIVA models spatial and temporal information separately to accommodate their inherent disparities as well as to reduce the computational complexity. Following this design strategy, we design several Transformer-based model variants to integrate spatial and temporal information harmoniously. Moreover, we identify the best practices in architectural choices and learning strategies for LAVITA through rigorous empirical analysis. Our comprehensive evaluation demonstrates that LAVITA achieves state-of-the-art performance across several standard video generation benchmarks, including FaceForensics, SkyTimelapse, UCF101, and Taichi-HD, outperforming current best models. We strongly believe that LAVITA provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.
Unconditional video generation on the Taichi-HD (256 x 256), FaceForensics (256 x 256) and SkyTimelapse (256 x 256) datasets.
Given the class, LAVITA is able to generate the desired videos. Results are shown on the UCF101 (256 x 256) datasets.
Results are shown by using LAVITA to generate disered videos. Results are shown on the Webv2m datasets and a subset of Laion5B (comprising approximately 6,400,000 images).
Decorate with pineapple sweet cake roll.
Reeds in the wind, razim lake, romania.
Slow pan upward of blazing oak fire in an indoor fireplace.
Flight over the country.
Sunset over the sea.
Visual comparison with other state-of-the-arts on UCF101, Taichi-HD, FaceForensics and SkyTimelapse datasets, respectively.
PVDM
Ours
DIGAN
PVDM
Ours
StyleGAN-V
PVDM
Ours
StyleGAN-V
PVDM
Ours