End-to-End Transformer for Compressed Video Quality Enhancement

Li Yu, Wenshuai Chang, Shiyu Wu, Moncef Gabbouj

Published: 01 Jan 2024, Last Modified: 06 Mar 2025IEEE Trans. Broadcast. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Convolutional neural networks have achieved excellent results in compressed video quality enhancement task in recent years. State-of-the-art methods explore the spatio-temporal information of adjacent frames mainly by deformable convolution. However, the CNN-based methods can only exploit local information, thus lacking the exploration of global information. Moreover, current methods enhance the video quality at a single scale, ignoring the multi-scale information, which corresponds to information at different receptive fields and is crucial for correlation modeling. Therefore, in this work, we propose a Transformer-based compressed video quality enhancement (TVQE) method, consisting of Transformer based Spatio-Temporal feature Fusion (TSTF) module and Multi-scale Channel-wise Attention based Quality Enhancement (MCQE) module. The proposed TSTF module learns both local and global features for correlation modeling, in which window-based Transformer and the encoder-decoder structure greatly improve the execution efficiency. The proposed MCQE module calculates the multi-scale channel attention, which aggregates the temporal information between channels in the feature map at multiple scales, achieving efficient fusion of inter-frame information. Extensive experiments on the JCT-VT test sequences show that the proposed method increases PSNR by up to 0.98 dB when QP=37. Meanwhile, the inference speed is improved by up to 9.4%, and the number of Flops is reduced by up to 84.4% compared to competing methods at 720p resolution. Moreover, the proposed method achieves the BD-rate reduction up to 23.04%.