Abstract: In this paper, we present a few-shot text-to-video frame-work, LAMP, which enables a text-to-image diffusion model to Learn A specific Motion Pattern with 8 ~16 videos on a single GPU. Unlike existing methods, which re-quire a large number of training resources or learn motions that are precisely aligned with template videos, it achieves a trade-off between the degree of generation freedom and the resource costs for model training. Specifically, we design a motion-content decoupled pipeline that uses an off-the-shelf text-to-image model for content generation so that our tuned video diffusion model mainly focuses on motion learning. The well-developed text-to-image techniques can provide visually pleasing and diverse content as generation conditions, which highly improves video quality and gen-eration freedom. To capture the features of temporal di-mension, we expand the pre-trained 2D convolution lay-ers of the T2I model to our novel temporal-spatial motion learning layers and modify the attention blocks to the temporal level. Additionally, we develop an effective in-ference trick, shared-noise sampling, which can improve the stability of videos without computational costs. Our method can also be flexibly applied to other tasks, e.g. real-world image animation and video editing. Extensive ex-periments demonstrate that LAMP can effectively learn the motion pattern on limited data and generate high-quality videos. The code and models are available at https://rq-wu.github.io/projects/LAMP.
Loading