Spatially-Guided Temporal Attention (SGuTA) and Shifted-Cube Attention (SCubA) for Video Frame Interpolation

21 Apr 2023 (modified: 12 Dec 2023)Submitted to NeurIPS 2023EveryoneRevisionsBibTeX
Keywords: Video frame interpolation, Transformer
TL;DR: Extensive experiments demonstrate that our models exhibit a high proficiency in handling large motions and providing precise motion estimation, resulting in new state-of-the-art results in various benchmark tests.
Abstract: In recent years, methods based on convolutional kernels have achieved state-of-the-art performance in video frame interpolation task. However, due to the inherent limitations of their convolutional kernel size, it seems that their performances have reached a plateau. On the other hand, Transformers are gradually replacing convolutional neural networks as a new backbone structure in image tasks, thanks to their ability to establish global correlations. However, in video tasks, the computational complexity and memory requirements of Transformer will become more challenging. To address this issue, we employ two different Transformers, SGuTA and SCubA, in VFI task. SGuTA utilizes the spatial information of each video frame to guide the generation of temporal vector at each pixel position. Meanwhile, SCubA introduces local attention into the VFI task, which can be viewed as a counterpart of 3D convolution in local attention Transformers. Additionally, we analyze and compare different embedding strategies and propose a more balanced embedding strategy in terms of parameter count, computational complexity, and memory requirements. Extensive quantitative and qualitative experiments demonstrate that our models exhibit high proficiency in handling large motions and providing precise motion estimation, resulting in new state-of-the-art results in various benchmark tests. The source code can be obtained at https://github.com/esthen-bit/SGuTA-SCubA.
Supplementary Material: zip
Submission Number: 302
Loading