Abstract: Video editing with diffusion models presents significant challenges, especially under real-time constraints. Current methods either enhance temporal consistency at the cost of slow processing or rely on frame-by-frame editing, leading to flickering and temporal artifacts. To address both challenges, we propose SmoothDiffusion-VE, a streaming-based editing approach that improves temporal consistency and processing speed through our proposed Adaptive Feature Cache (AFC) and motion-guided attention. The AFC dynamically adjusts the caching behavior based on perceptual similarity (LPIPS) between frames, ie, shifting to a mini-cache mode for similar frames to reduce computational load. Conversely, significant frame changes trigger deeper caching to maintain robust temporal coherence. Our motion-guided attention selectively focuses on dynamic regions using optical flow, reducing unnecessary computations in static areas and accelerating processing. SmoothDiffusion-VE can run 28 FPS on one RTX 4090 GPU, achieving a 1564xspeedup over Plug-and-Play Diffusion (PNP) and a 1916xspeedup over Diffusion Motion Transfer (DMT), delivering a powerful solution for fast and consistent video editing.
Loading