Adaptive multi-frame sampling for consistent zero-shot text-to-video editing

TMLR Paper6877 Authors

07 Jan 2026 (modified: 16 Jan 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Achieving convincing temporal coherence is a fundamental challenge in zero-shot text-to-video editing. To address this issue, this paper introduces AMAC (Adaptive Multi-frame sAmpling for Consistent zero-shot text-to-video editing), a novel method that effectively balances temporal consistency with detail preservation. Our approach proposes a theoretical framework with a fully adaptive sampling strategy that selects frames for joint processing using a pre-trained text-to-image diffusion model. By reformulating the sampling strategy as a stochastic permutation over frame indexes and constructing its distribution based on inter-frame similarities, we promote consistent processing of related content. This method demonstrates superior robustness against temporal variations and shot transitions, making it particularly well-suited for editing long dynamic video sequences, as validated through experiments on DAVIS and BDD100K datasets. Some examples of generated videos are available in the following anonymous repository https://anonymous.4open.science/r/AMAC-A406.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Zhengzhong_Tu1
Submission Number: 6877
Loading