TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation

Published: 19 Oct 2025, Last Modified: 04 Mar 2026ICCV 2025EveryoneCC BY 4.0
Abstract: Video Frame Interpolation (VFI) aims to predict the in- termediate frame In (we use n to denote time in videos to avoid notation overload with the timestep t in diffusion models) based on two consecutive neighboring frames I0 and I1. Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance. However, image-based diffusion mod- els are unable to extract temporal information and are rela- tively inefficient compared to non-diffusion methods. Video- based diffusion models can extract temporal information, but they are too large in terms of training scale, model size, and inference time. To mitigate the above issues, we propose Temporal-Aware Latent Brownian Bridge Dif- fusion for Video Frame Interpolation (TLB-VFI), an effi- cient video-based diffusion model. By extracting rich tem- poral information from video inputs through our proposed 3D-wavelet gating and temporal-aware autoencoder, our method achieves 20% improvement in FID on the most chal- lenging datasets over recent SOTA of image-based diffusion models. Meanwhile, due to the existence of rich tempo- ral information, our method achieves strong performance while having 3× fewer parameters. Such a parameter re- duction results in 2.3× speed up. By incorporating opti- cal flow guidance, our method requires 9000× less training data and achieves over 20× fewer parameters than video- based diffusion models.
Loading