Abstract: Video Frame Interpolation (VFI) aims to predict the in-
termediate frame In (we use n to denote time in videos
to avoid notation overload with the timestep t in diffusion
models) based on two consecutive neighboring frames I0
and I1. Recent approaches apply diffusion models (both
image-based and video-based) in this task and achieve
strong performance. However, image-based diffusion mod-
els are unable to extract temporal information and are rela-
tively inefficient compared to non-diffusion methods. Video-
based diffusion models can extract temporal information,
but they are too large in terms of training scale, model
size, and inference time. To mitigate the above issues,
we propose Temporal-Aware Latent Brownian Bridge Dif-
fusion for Video Frame Interpolation (TLB-VFI), an effi-
cient video-based diffusion model. By extracting rich tem-
poral information from video inputs through our proposed
3D-wavelet gating and temporal-aware autoencoder, our
method achieves 20% improvement in FID on the most chal-
lenging datasets over recent SOTA of image-based diffusion
models. Meanwhile, due to the existence of rich tempo-
ral information, our method achieves strong performance
while having 3× fewer parameters. Such a parameter re-
duction results in 2.3× speed up. By incorporating opti-
cal flow guidance, our method requires 9000× less training
data and achieves over 20× fewer parameters than video-
based diffusion models.
Loading