VIDES: VIDEO EDITING IN SECONDS WITH ONE-STEP DIFFUSION MODELS

19 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Editing, Diffusion Models, Generative Models, Text editing
Abstract: Text-guided video editing with diffusion models is prohibitively slow, hindered by costly multi-step sampling and inversion. We present VIDES, the first framework to successfully adapt one-step text-to-image (T2I) models for high-quality video editing, addressing the core challenges of inversion, editability, and temporal consistency. To bypass slow iterative inversion, we train a learnable encoder that predicts the initial noise for each frame in a single forward pass. This encoder is trained with a novel Structure-Aware Editing (SAE) loss on a curated dataset of structurally-aligned image pairs, teaching it to preserve the source video's geometry during edits. For temporal coherence, we introduce Unified-Frame Editing (UFE), a technique that concatenates frame latents to facilitate cross-frame attention in a single generation step; for long videos, a sliding-window strategy with an anchor frame maintains global consistency. Our extensive experiments demonstrate that VIDES achieves editing quality comparable or superior to state-of-the-art multi-step methods, while operating approximately 155 times faster. This breakthrough paves the way for practical, real-time video editing applications.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 18735
Loading