Abstract: Cross-modal 2D–3D point cloud semantic segmentation on few-shot-based learning provides a practical approach for borrowing matured 2D domain knowledge into the 3D segmentation model, which reduces the reliance on laborious 3D annotation work and improves generalization to new categories. However, previous methods use single-view point cloud generation algorithms to bridge the gap between 2D images and 3D point clouds, leaving the incomplete geometry of an object or scene due to occlusions. To address this issue, we propose a novel view synthesis cross-modal few-shot point cloud semantic segmentation network. It introduces the color and depth inpainting to generate multi-view images and masks, which compensate for the absent depth information of generated point clouds. Additionally, we propose a cross-modal embedding network to bridge the domain features between synthesized and original, collected 3D data, and a weighted prototype network is employed to balance the impact of multi-view images and enhance the segmentation performance. Extensive experiments on two benchmarks show the superiority of our method by outperforming the existing cross-modal few-shot 3D segmentation methods.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: This work targets on cross-modal 3D point cloud semantic segmentation via fusing 2D modality information. This work based on few-shot learning which significantly reduces the laborious 3D annotation work improves the generalization to new categories. In addition, it employs view-synthesis to compensate information of the incomplete geometry which enhance the segmentation performance compared to previous methods.
Submission Number: 3935
Loading