Keywords: Video Editing, 3D Scene Editing, Video Diffusion Models, Generative Models
Abstract: We introduce HDEdit, a training-free framework for instruction-guided video and 3D scene editing that resolves the fundamental tension between instruction fulfillment and original content preservation through **H**ierarchical task **D**ecomposition. Our key insight is to progressively decompose complex edits into simpler subtasks. This hierarchical strategy aligns with dual objectives: an LLM-guided planner structures high-level subgoals for reliable instruction fulfillment, while embedding-space interpolation further refines each subgoal to preserve unedited content. Two tailored control mechanisms -- word-level attention map propagation and parallel denoising synchronization -- ensure temporally consistent, hyperparameter tuning-free execution. Beyond video, we extend HDEdit to 3D editing via a simple yet effective render-edit-reconstruct process that maintains strong geometric consistency. Extensive experiments demonstrate our state-of-the-art results across diverse and challenging edits, including long-duration videos, fast camera motion, and significant 3D geometric changes.
Supplementary Material: zip
Submission Number: 321
Loading