STAR-VP: Improving Long-term Viewport Prediction in 360° Videos via Space-aligned and Time-varying Fusion

Published: 20 Jul 2024, Last Modified: 30 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Accurate long-term viewport prediction in tile-based 360° video adaptive streaming helps pre-download tiles for a further future, thus establishing a longer buffer to cope with network fluctuations. Long-term viewport motion is mainly influenced by Historical viewpoint Trajectory (HT) and Video Content information (VC). However, HT and VC are difficult to align in space due to their different modalities, and their relative importance in viewport prediction varies across prediction time steps. In this paper, we propose STAR-VP, a model that fuses HT and VC in a Space-aligned and Time-vARying manner for Viewport Prediction. Specifically, we first propose a novel saliency representation $salxyz$ and a Spatial Attention Module to solve the spatial alignment of HT and VC. Then, we propose a two-stage fusion approach based on Transformer and gating mechanisms to capture their time-varying importance. Visualization of attention scores intuitively demonstrates STAR-VP's capability in space-aligned and time-varying fusion. Evaluation on three public datasets shows that STAR-VP achieves state-of-the-art accuracy for long-term (2-5s) viewport prediction without sacrificing short-term ($<$1s) prediction performance.
Relevance To Conference: In this paper, we propose STAR-VP, a long-term viewport prediction model for 360° video streaming. A viewport prediction model with excellent long-term performance helps 360° video streaming systems pre-download tiles for a further future, thereby establishing a longer buffer to cope with network fluctuations. Evaluation on three public datasets shows that STAR-VP achieves state-of-the-art accuracy for long-term (2-5s) viewport prediction without sacrificing short-term (<1s) prediction performance. Specifically, STAR-VP better fuses viewpoint and saliency information in a space-aligned and time-varying manner. We first propose a novel saliency representation 𝑠𝑎𝑙𝑥𝑦𝑧 and a Spatial Attention Module to solve the spatial alignment of HT and VC. Then, we propose a two-stage fusion approach based on Transformer and gating mechanisms to capture their time-varying importance. Visualization of attention scores intuitively demonstrates STAR-VP’s capability in space-aligned and time-varying fusion.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Systems] Transport and Delivery, [Experience] Multimedia Applications
Submission Number: 3168
Loading