Shaping a Stabilized Video by Mitigating Unintended Changes for Concept-Augmented Video Editing

Mingce Guo, Jingxuan He, Yufei Yin, Zhangye Wang, Shengeng Tang, Lechao Cheng

Published: 2025, Last Modified: 21 Jan 2026IJCAI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Text-driven video editing powered by generative diffusion models holds significant promise for applications spanning film production, advertising, and beyond. However, the limited expressiveness of pre-trained word embeddings often restricts nuanced edits, especially when targeting novel concepts with specific attributes. In this work, we present a novel Concept-Augmented Textual Inversion (CATI) framework that flexibly integrates new object information from user-provided concept videos. By fine-tuning only the V (Value) projection in attention via Low-Rank Adaptation (LoRA), our approach preserves the original attention distribution of the diffusion model while efficiently incorporating external concept knowledge. To further stabilize editing results and mitigate the issue of attention dispersion when prompt keywords are modified, we introduce a Dual Prior Supervision (DPS) mechanism. DPS supervises cross-attention between the source and target prompts, preventing undesired changes to non-target areas and improving the fidelity of novel concepts. Extensive evaluations demonstrate that our plug-and-play solution not only maintains spatial and temporal consistency but also outperforms state-of-the-art methods in generating lifelike and stable edited videos. The source code is publicly available at https://guomc9.github.io/STIVE-PAGE/.

External IDs:dblp:conf/ijcai/Guo0YWTC25