VPFormer: Leveraging Transformer with Voxel Integration for Viewport Prediction in Volumetric Video

Jie Li, Zhixia Zhao, Qiyue Li, Zhixin Li, Pengyuan Zhou, Zhi Liu, Hao Zhou, Zhu Li

Published: 2025, Last Modified: 25 Mar 2026ACM Trans. Multim. Comput. Commun. Appl. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the continuous advancement of computer vision, image processing technologies, volumetric video, represented by point cloud videos, holds the potential for extensive applications in areas such as Virtual Reality (VR) and Augmented Reality (AR). Viewport prediction, also referred to as Field of View (FoV) prediction, is a crucial component in emerging VR and AR applications, playing a vital role in the transmission of point cloud videos. Currently, models for viewpoint prediction that integrate feature extraction and FoV information heavily rely on the spatial-temporal features extracted by convolutional neural networks. However, the drawback of 3D convolution lies in its inability to effectively capture long-term spatial-temporal dependencies within videos. Moreover, the temporal contrast layer used for time feature extraction only compares features within each block, leading to matching errors and inaccurate temporal feature extraction, consequently diminishing predictive performance. To address these limitations, we propose a Transformer-based Volumetric Point Cloud Video Viewport Prediction Network (VPFormer) that can efficiently extract spatial-temporal features from point cloud videos. VPFormer constitutes a viewport prediction framework that combines the spatial-temporal features of point cloud videos with user trajectory information. Specifically, we introduce a novel sampling method that effectively preserves spatial-temporal information while reducing computational complexity. Additionally, we incorporate context-aware dynamic positional encoding to capture inter-frame spatial-temporal context information. Subsequently, we introduce a voxel-based temporal contrast layer and partition the point cloud into smaller voxel blocks during feature matching, significantly reducing matching errors and enhancing the analysis and extraction of temporal features. Finally, by combining the spatial-temporal features of point cloud videos with user head trajectory information, we successfully predict future user viewpoints. Experimental results demonstrate that this approach outperforms other solutions in terms of performance.

External IDs:dblp:journals/tomccap/LiZLLZLZL25