Building Blocks of Text-to-Video Generation

03 Feb 2023 (modified: 02 May 2023)Submitted to Blogposts @ ICLR 2023Readers: Everyone
Keywords: text to video generation, generative models, imagen video, make a video, super resolution, autoencoders, attention
Abstract: Just six months after the release of DALL-E 2, both Meta and Google released novel Text-to-Video generation models that output impressive video-format content. These networks build off the recent advancements in stable diffusion-based Text-to-Image models. Meta’s Make-A-Video can generate five-second 768x768 clips at variable frame rates while Google’s Imagen Video can produce five-second 1280×768 videos at 24 fps. Rather than training strictly on text-video pair datasets, both Imagen Video and Make-a-Video leverage massive text-image pair databases to construct video from pretrained Text-to-Image generation models. These Text-to-Video generators are capable of creating high-resolution, photorealistic, and stylistic content of impossible scenarios. Networks such as these can be powerful tools for artists and creators as well as be the basis for predicting future frames of a video. Although these models demonstrate impressive video production capabilities, neither Imagen Video or Make-a-Video Text-to-Video model has yet been released publicly for community use, only to select individuals, due to potentially harmful content and biases existing in the training data. However, this dilemma is a double-edged sword – without the release of these complex model architectures, both reproducibility and interpretability of the results are hindered in addition to increasing the divide between research “winner and losers”, $\textit{i.e.}$, people who have access to these models and people who do not. Hence, in this blog, we dissect and explain the mechanics behind the key building blocks for state-of-the-art Text-to-Video generation using publicly available information and papers to improve the interpretability of these models until they are released for community use. We provide interactive and graphical examples of these building blocks to demonstrate the key novelties and differences between the two new state-of-the-art Text-to-Video models: Google's Imagen Video and Meta's Make-a-Video. Finally, we summarize by illustrating how these building blocks fit together into a complete Text-to-Video framework as well as by noting the current failure modes and limitations of the models today. We aim for this blog to improve both the interpretability and reproducibility of Text-to-Video generation, and of generative models as a whole, by providing a general review for the technical reader of state-of-the-art papers in the field.
Blogpost Url: https://iclr-blogposts.github.io/staging/blog/2022/text2vid
ICLR Papers: https://openreview.net/forum?id=nJfylDvgzlq, https://arxiv.org/abs/2210.02303
ID Of The Authors Of The ICLR Paper: ~Uriel_Singer1
Conflict Of Interest: No
5 Replies

Loading