Keywords: Video editing, diffusion models, semantic mixing
TL;DR: MoCA-Video is a training-free framework that enables controllable, temporally consistent semantic video editing by manipulating diffusion noise trajectories.
Abstract: We present MoCA-Video, a training-free framework for semantic mixing in videos.
Operating in the latent space of a frozen video diffusion model, MoCA-Video utilizes class-agnostic segmentation with diagonal denoising scheduler to localize and track the target object across frames.
To ensure temporal stability under semantic shifts, we introduce momentum-based correction to approximate novel hybrid distributions beyond trained data distribution, alongside a light gamma residual module that smooths out visual artifacts.
We evaluate model's performance using SSIM, LPIPS, and a proposed metric, MoCA-Video, which quantifies semantic alignment between reference and output.
Extensive evaluation demonstrates that our model consistently outperforms both training-free and trained baselines, achieving superior semantic mixing and temporal coherence without retraining. Results establish that structured manipulation of diffusion noise trajectories enables controllable and high-quality video editing under semantic shifts.
Supplementary Material: pdf
Primary Area: generative models
Submission Number: 11677
Loading