Abstract: We present VideoPoet, a model for synthesizing
high-quality videos from a large variety of conditioning signals. VideoPoet employs a decoderonly transformer architecture that processes multimodal inputs – including images, videos, text,
and audio. The training protocol follows that of
Large Language Models (LLMs), consisting of
two stages: Pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates
a mixture of multimodal generative objectives
within an autoregressive Transformer framework.
The pretrained LLM serves as a foundation that is
adapted to a range of video generation tasks. We
present results demonstrating the model’s state-ofthe-art capabilities in zero-shot video generation,
specifically highlighting the generation of highfidelity motions.
Loading