Video Adapter

Probabilistic Adaption of Black-Box Text-to-Video Models

Probabilistic Adaptation of Black-Box Text-to-Video Models

Abstract

Large text-to-video models trained on internet-scale data have demonstrated exceptional capabilities in generating high-fidelity videos from arbitrary textual descriptions. However, similar to proprietary language models, large text-to-video models are often black boxes whose weight parameters are not publicly available, posing a significant challenge to adapting these models to specific domains such as robotics, animation, and personalized stylization. Inspired by how a large language model can be prompted to perform new tasks without access to the model weights, we investigate how to adapt a black-box pretrained text-to-video model to a variety of downstream domains without weight access to the pretrained model. In answering this question, we propose Video Adapter, which leverages the score function of a large pretrained video diffusion model as a probabilistic prior to guide the generation of a task-specific small video model. Our experiments show that, by incorporating broad knowledge and fidelity of the pretrained model probabilistically, a small model with as few as 1.25% parameters of the pretrained model can generate high-quality yet domain-specific videos for a variety of downstream domains such as animation, egocentric modeling, and modeling of simulated and real-world robotics data. As large text-to-video models starting to become available as a service similar to large language models, we advocate for private institutions to expose scores of video diffusion models as outputs in addition to generated videos to allow flexible adaptation of large pretrained text-to-video models by the general public.

Video Adapter Framework

Adaptation through Score Composition

Video Adapter only requires training a small domain-specific text-to-video model with orders of magnitude fewer parameters than a large video model pretrained from internet data. During sampling, Video Adapter composes the scores of the pretrained and the domain specific video models, achieving high-quality and flexible video synthesis.

Video Adapter for Animation and Robotics

We can train a small video model on an animation style of a particular artist (Detective Conan). The pretrained prior can maintain the artist's style while changing the background. We can also train task-specific small edge-to-sim and edge-to-real models on robotic videos. The pretrained prior can be used to modify the styles of the videos as a form of domain randomization.

Ablation against Naive Classifier-Free Score Mix

Below we show ablation of Video Adapter (with different prior strength) against naively combining classifier-free scores. Video Adapter modifies the style as instructed (top), whereas directly mixing two classifier-free guidance scores fails to adapt the video (bottom).