MVP-Net: Multi-View Depth Image Guided Cross-Modal Distillation Network for Point Cloud Upsampling

Published: 20 Jul 2024, Last Modified: 01 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Point cloud upsampling concerns producing a dense and uniform point set from a sparse and irregular one. Current upsampling methods primarily encounter two challenges: (i) insufficient uni-modal representations of sparse point clouds, and (ii) inaccurate estimation of geometric details in dense point clouds, resulting in suboptimal upsampling results. To tackle these challenges, we propose MVP-Net, a multi-view depth image guided cross-modal detail estimation distillation network for point cloud upsampling, in which the multi-view depth images of point clouds are fully explored to guide upsampling. Firstly, we propose a cross-modal feature extraction module, consisting of two branches designed to extract point features and depth image features separately. This setup aims to produce sufficient cross-modal representations of sparse point clouds. Subsequently, we design a Multi-View Depth Image to Point Feature Fusion (MVP) block to fuse the cross-modal features in a fine-grained and hierarchical manner. The MVP block is incorporated into the feature extraction module. Finally, we introduce a paradigm for multi-view depth image-guided detail estimation and distillation. The teacher network fully utilizes paired multi-view depth images of sparse point clouds and their dense counterparts to formulate multi-hierarchical representations of geometric details, thereby achieving high-fidelity reconstruction. Meanwhile, the student network takes only sparse point clouds and their multi-view depth images as input, and it learns to predict the multi-hierarchical detail representations distilled from the teacher network. Extensive qualitative and quantitative results on both synthetic and real-world datasets demonstrate that our method outperforms state-of-the-art point cloud upsampling methods.
Primary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: As a prevalent 3D multimedia data format, point cloud is widely used in many multimedia applications, such as virtual reality and augmented reality. However, raw point clouds captured by 3D sensors often suffer from sparsity, non-uniformity, noise, and outliers. Many other modal data, such as 2D depth images, are easily obtained alongside 3D point clouds. This work introduces a multi-view depth image guided cross-modal distillation network for point cloud upsampling, contributing to multimedia data processing by incorporating the information from 2D depth images to improve 3D point cloud upsampling. It demonstrates that only utilizing the 2D depth image of the sparse 3D point cloud can improve the performance of upsampling, without needing additional information from 3D objects or 3D scenes. Therefore, in multimedia applications like virtual reality or autonomous driving, this approach can enhance user experience or improve recognition accuracy by converting sparse point clouds to high-fidelity ones, containing finer geometric details. Overall, this work extends the framework of existing upsampling methods by introducing 2D depth maps to guide point cloud upsampling.
Supplementary Material: zip
Submission Number: 4685
Loading