Enhancing Low-Cost Video Editing with Lightweight Adaptors and Temporal-Aware Inversion

Published: 22 Jan 2026, Last Modified: 06 Mar 2026CPAL 2026 (Proceedings Track) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Text-to-Image (T2I) Generation, Diffusion Models, Text-to-Video (T2V) Editing, Temporal Consistency, Spatial Consistency
TL;DR: A lightweight GE-Adapter enhances T2V editing by integrating temporal, spatial, and semantic consistency with DDIM inversion, achieving significantly improved coherence and perceptual quality without costly model retraining.
Abstract: Recent advancements in text-to-image (T2I) generation using diffusion models have enabled cost-effective video-editing applications by leveraging pre-trained models, eliminating the need for resource-intensive training. However, the frame independence of T2I generation often results in poor temporal consistency. Existing methods address this issue through temporal layer fine-tuning or inference-based temporal propagation, but these approaches suffer from high training costs or limited temporal coherence. To address these challenges, we propose a General and Efficient Adapter (GE-Adapter) that integrates temporal, spatial and semantic consistency with Baliteral Denoising Diffusion Implicit Models (DDIM) inversion. This framework introduces three key components: (1) Frame-based Temporal Consistency Blocks (FTC Blocks) to capture frame-specific features and enforce smooth inter-frame transitions using temporally aware loss functions; (2) Channel-dependent Spatial Consistency Blocks (SCD Blocks) employing bilateral filters to enhance spatial coherence by reducing noise and artifacts; (3) a Token-based Semantic Consistency Module (TSC Module) to maintain semantic alignment through a combination of shared prompt tokens and frame-specific tokens. Extensive experiments on multiple datasets demonstrate that our method significantly improves perceptual quality, text-image relevance, and temporal coherence. The proposed approach offers a practical and efficient solution for text-to-video (T2V) editing. Our code is available in the supplementary materials.
Submission Number: 46
Loading