Single-Video Temporal Consistency Enhancement with Rolling Guidance

Xiaonan Fang, Song-Hai Zhang

Published: 01 Jan 2024, Last Modified: 05 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Image/video synthesis has been extensively studied in academics, and computer-generated videos are becoming increasingly popular among the general public. However, ensuring the temporal consistency of generated videos is still a challenging problem. Most existing algorithms for temporal consistency enhancement rely on the motion cues from a guidance video to filter the temporally inconsistent video. This paper proposes a novel approach that processes single-video input to achieve temporal consistency. The key observation is that we can obtain a coarse guidance video through temporal smoothing and refine its visual quality using a rolling guidance pipeline. We only use an off-the-shelf optical-flow estimation model as external visual knowledge. The proposed algorithm has been evaluated on a wide range of videos synthesized by various methods, including single-image processing models and text-to-video models. Our method effectively eliminates temporal inconsistency while preserving the input visual content.

External IDs:doi:10.1007/978-981-97-2092-7_6