VideoGen: Generative Modeling of Videos using VQ-VAE and Transformers

Yunzhi Zhang; Wilson Yan; Pieter Abbeel; Aravind Srinivas

VideoGen: Generative Modeling of Videos using VQ-VAE and Transformers

Yunzhi Zhang, Wilson Yan, Pieter Abbeel, Aravind Srinivas

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: video generation, vqvae, transformers, gpt

Abstract: We present VideoGen: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGen uses VQ-VAE that learns learns downsampled discrete latent representations of a video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation, ease of training and a light compute requirement, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate coherent action-conditioned samples based on experiences gathered from the VizDoom simulator. We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models without requiring industry scale compute resources. Samples are available at https://sites.google.com/view/videogen

One-sentence Summary: Video generation model with latent space autoregressive transformer

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Supplementary Material: zip

Reviewed Version (pdf): https://openreview.net/references/pdf?id=6GuZoihT6

12 Replies

Loading