Few-Shot Learning in Video Diffusion Models

Published: 26 May 2026, Last Modified: 03 Jun 2026ICML 2026 FoGen Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: few-shot learning, video diffusion models, diffusion models, parameter-efficient adaptation, LoRA, visual priors, image-to-image tasks, generative models, generalist vision models
TL;DR: Pretrained video diffusion models can act as few-shot visual learners: by encoding tasks as transition videos and tuning small LoRA adapters, they generalize across diverse vision tasks with as few as 1–30 examples.
Abstract: Video Diffusion Models (VDMs) are trained for video generation, yet this objective implicitly induces structured visual representations that extend beyond this task. In this work, we investigate whether such pretrained models can be adapted to perform classical computer vision tasks from only a few examples. We introduce a simple few-shot adaptation framework in which each task is specified by a small set of paired input and output images, encoded as short transition videos. Lightweight LoRA adapters are trained while keeping the VDM backbone frozen, and predictions are obtained from the final frame of generated sequences. Across a diverse set of image-to-image tasks, including geometric transformations, style transfer, dense prediction, and classification, we find that the model can generalize from as few as one to thirty examples. Performance varies across tasks, with simpler transformations requiring minimal supervision and more structured problems benefiting from additional examples. These results suggest that pretrained VDMs encode reusable visual priors that can be exposed and steered through limited supervision. Overall, our findings position video diffusion models as flexible visual learners with the potential to become vision generalists.
Submission Number: 43
Loading