Abstract: Video super-resolution techniques aim to obtain high-resolution equivalents of existing low-resolution videos through a series of operations. In recent research, transformers have been increasingly popular because of their remarkable abilities in parallel computing and efficient extraction of space-time sequence features from videos. Moreover, combining self-attention and multi-scale methods has yielded excellent results. However, the combination of the two methods has limitations, current up-sampling methods struggle to match the global modeling capacity of self-attention mechanisms. Therefore, this paper proposes three strategies to combine the two methods. Based on the approximation strategy, we first construct a new bilinear up-sampling method for multi-scale acquisition. Convolution and cross-attention techniques are then used to correct and align features at different scales to prevent large deviations in feature extraction at a specific scale, which can affect subsequent feature extraction. Finally, to effectively solve the common computational complexity, <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$ C^{0} $ </tex-math></inline-formula> continuity, and neuron death problems of existing activation functions, a new method to construct the activation function is proposed. The cubic spline function is used to construct a new activation function approximating tanh. The new activation function is <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$ C^{2} $ </tex-math></inline-formula> continuous, which is piecewise defined by cubic polynomial curves. In this study, better results were achieved on three public video super-resolution test sets: REDS4, Vid4, and Vimeo-90K-T. Experiments demonstrated that the proposed method could provide a new solution for video super-resolution tasks.
0 Replies
Loading