SmoothDiffusion-VE: Real-time Generative Video Editing Using Adaptive Feature Cache

Published: 01 Mar 2026, Last Modified: 24 Mar 2026IEEE/CVF Winter Conference on Applications of Computer VisionEveryoneCC BY-NC-ND 4.0
Abstract: Video editing with diffusion models presents significant challenges, especially under real-time constraints. Current methods either enhance temporal consistency at the cost of slow processing or rely on frame-by-frame editing, leading to flickering and temporal artifacts. To address both challenges, we propose SmoothDiffusion-VE, a streaming-based editing approach that improves temporal consistency and processing speed through our proposed Adaptive Feature Cache (AFC) and motion-guided attention. The AFC dynamically adjusts the caching behavior based on perceptual similarity (LPIPS) between frames, ie, shifting to a mini-cache mode for similar frames to reduce computational load. Conversely, significant frame changes trigger deeper caching to maintain robust temporal coherence. Our motion-guided attention selectively focuses on dynamic regions using optical flow, reducing unnecessary computations in static areas and accelerating processing. SmoothDiffusion-VE can run 28 FPS on one RTX 4090 GPU, achieving a 1564xspeedup over Plug-and-Play Diffusion (PNP) and a 1916xspeedup over Diffusion Motion Transfer (DMT), delivering a powerful solution for fast and consistent video editing.
Loading