Patch’n Play: Zero-Shot Video Editing by Fusing Local and Global Patches

Published: 24 Sept 2025, Last Modified: 07 Nov 2025NeurIPS 2025 Workshop GenProCCEveryoneRevisionsBibTeXCC BY 4.0
Track: Short paper
Keywords: video editing
Abstract: Recent progress in diffusion-based models has shown remarkable achievements in generating images from text prompts. Despite these advancements, video editing methods have lagged in achieving comparable visual quality and editing capabilities. This paper introduces Patch'n Play, a novel zero-shot video editing method that leverages both local and global latent features to enhance temporal consistency. Unlike previous approaches that prioritize global consistency at the expense of local consistency, our method aggregates and fuses local features from each frame along with global information shared across multiple frames. Compatible with pre-trained text-to-image diffusion models, our approach does not require prompt-specific training or user-generated masks. Our qualitative and quantitative analysis underscores Patch'n Play's superior performance across a wide array of video editing contexts against existing methods.
Submission Number: 50
Loading