VideoPoet: A Large Language Model for Zero-Shot Video Generation

Jonathan Huang

Published: 21 Jul 2024, Last Modified: 24 May 2024icml 2024EveryoneCC BY 4.0

Abstract: We present VideoPoet, a model for synthesizing high-quality videos from a large variety of conditioning signals. VideoPoet employs a decoderonly transformer architecture that processes multimodal inputs – including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: Pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that is adapted to a range of video generation tasks. We present results demonstrating the model’s state-ofthe-art capabilities in zero-shot video generation, specifically highlighting the generation of highfidelity motions.