Abstract: Latent Diffusion Models (LDMs) are renowned for their powerful capabilities in image and video synthesis.
Yet, compared to text-to-image (T2I) editing, text-to-video (T2V) editing suffers from a lack of decent temporal consistency and structure, due to insufficient pre-training data, limited model editability, or extensive tuning costs. To address this gap, we propose FLDM (Fused Latent Diffusion Model), a training-free framework that achieves high-quality T2V editing by integrating various T2I and T2V LDMs. Specifically, FLDM utilizes a hyper-parameter with an update schedule to effectively fuse image and video latents during the denoising process.
This paper is the first to reveal that T2I and T2V LDMs can complement each other in terms of structure and temporal consistency, ultimately generating high-quality videos.
It is worth noting that FLDM can serve as a versatile plugin, applicable to off-the-shelf image and video LDMs, to significantly enhance the quality of video editing.
Extensive quantitative and qualitative experiments on popular T2I and T2V LDMs demonstrate FLDM's superior editing quality than state-of-the-art T2V editing methods.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: we propose FLDM (Fused Latent Diffusion Model), a simple-yet-effective strategy that enhances the temporal consistency and structure of edited videos without any tuning cost. Our method can serve as a versatile plugin for various off-the-shelf T2I and T2V models, which we believe will be valuable for real-world practice.
Supplementary Material: zip
Submission Number: 1490
Loading