MAESR360: Masked autoencoder-based 360-degree video streaming via multi-scale feature fusion

Published: 01 Jan 2024, Last Modified: 06 Mar 2025VCIP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: 360-degree video streaming is becoming increasingly popular for its immersive experience. Traditional adaptive tile-based streaming methods allocate the bitrates according to view-port prediction, which effectively reduces required transmission bandwidth, but it will cause serious quality degradation when the viewport prediction is inaccurate. Thus, some researchers propose visual reconstruction and enhancement-based 360-degree video streaming framework, which can reconstructs the whole frame at very low bitrates. However, existing frameworks are built upon image-based visual reconstruction methods, which do not fully consider the characteristics of videos. In this paper, we propose a masked autoencoder-based, multi-scale optimized framework for 360-degree video streaming (MAESR360), which fully considers the temporal relevance of the video. We utilize spatio-temporal downsampling and high-ratio tube masking strategies to effectively reduce the amount of transmitted data. Additionally, we design a lightweight visual reconstruction model based on multi-scale feature fusion to recover the visual quality of video frames. The effectiveness of our proposed method is demonstrated through extensive experiments.
Loading