MVP-Net: Multi-View Depth Image Guided Cross-Modal Distillation Network for Point Cloud Upsampling

Published: 01 Jan 2024, Last Modified: 13 Nov 2024ACM Multimedia 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Point cloud upsampling concerns producing a dense and uniform point set from a sparse and irregular one. Current upsampling methods primarily encounter two challenges: (i) insufficient uni-modal representations of sparse point clouds, and (ii) inaccurate estimation of geometric details in dense point clouds, resulting in suboptimal upsampling results. To tackle these challenges, we propose MVP-Net, a multi-view depth image guided cross-modal detail estimation distillation network for point cloud upsampling, in which the multi-view depth images of point clouds are fully explored to guide upsampling. Firstly, we propose a cross-modal feature extraction module, consisting of two branches designed to extract point features and depth image features separately. This setup aims to produce sufficient cross-modal representations of sparse point clouds. Subsequently, we design a Multi-View Depth Image to Point Feature Fusion (MVP) block to fuse the cross-modal features in a fine-grained and hierarchical manner. The MVP block is incorporated into the feature extraction module. Finally, we introduce a paradigm for multi-view depth image-guided detail estimation and distillation. The teacher network fully utilizes paired multi-view depth images of sparse point clouds and their dense counterparts to formulate multi-hierarchical representations of geometric details, thereby achieving high-fidelity reconstruction. Meanwhile, the student network takes only sparse point clouds and their multi-view depth images as input, and it learns to predict the multi-hierarchical detail representations distilled from the teacher network. Extensive qualitative and quantitative results on both synthetic and real-world datasets demonstrate that our method outperforms state-of-the-art point cloud upsampling methods.
Loading