Prompt Learning Based Adaptor for Enhanced Video Editing with Pretrained Text-to-Image Diffusion Models

Yangfan He; Sida Li; Jianhui Wang

Prompt Learning Based Adaptor for Enhanced Video Editing with Pretrained Text-to-Image Diffusion Models

Yangfan He, Sida Li, Jianhui Wang

Published: 10 Oct 2024, Last Modified: 19 Nov 2024AFM 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: lora, diffusion, video generation

Abstract: The rapid advancement of text-to-image generation technologies based on diffusion models has produced remarkable results, driving the emergence of video editing applications built upon these pretrained models. To achieve temporal consistency with independent framewise text-to-image generation, existing video editing models either fine-tune temporal layers or propagate temporal features at test time without additional training. While these approaches show promise, the frame independence of text-to-image generation creates a bottleneck in delivering consistent and high-quality video outputs.In this paper, we propose a lightweight adaptor utilizing prompt learning to enhance video editing performance with minimal training cost. Our approach introduces shared prompt tokens to improve editing capabilities and unshared frame-specific tokens to impose consistency constraints across frames. The adaptor seamlessly integrates into existing video editing pipelines, offering significant improvements in temporal coherence and overall video quality, benefiting a broad spectrum of downstream video editing algorithms.

Submission Number: 5

Loading