MOTION: Multi-object Video Editing with Training-Free Attention Guidance

Qitong Yan, Jian Jia, Shengyuan Liu, Chang Liu, Bo Wang, Quan Chen, Peng Jiang, Minfeng Zhu, Linchao Zhu, Wei Chen

Published: 2025, Last Modified: 04 Mar 2026ICIC (18) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recent advancements in video editing have achieved remarkable success. Although current methods perform effectively when modifying a single object, they struggle to achieve satisfactory results when editing multiple objects with different modifications within a video. Existing methods usually face two major challenges: attribute leakage and attribute missing. Attribute leakage occurs when text attributes are mistakenly injected into regions not belonging to the target object. Attribute missing refers to scenarios where the object region is not effectively transformed to correspond with the textual content. In this paper, we present MOTION, a Multi-Object video editing framework with Training-free attentION guidance. Specifically, for the attribute leakage issue, we propose a position-guided cross-attention mechanism to ensure that textual attribute modifications are focused on the corresponding objects. For the attribute missing problem, we propose an instance-aware attention enhancement mechanism that automatically enhances the attention values of the regions to be edited. In addition, we construct a dataset to facilitate a comprehensive evaluation of multi-object video editing, where each video contains multiple objects that need to be edited separately. To address the challenge of CLIP in evaluating the editing effects on each object in videos, we introduce a precise metric, αCLIP Score, which enhances the quantitative evaluation of multi-object video editing. Extensive quantitative experiments and visualizations demonstrate that our method offers significant advantages over existing methods in the multi-object video editing task.

External IDs:dblp:conf/icic/YanJLLWCJZZC25