Affinity3D: Propagating Instance-Level Semantic Affinity for Zero-Shot Point Cloud Semantic Segmentation
Abstract: Zero-shot point cloud semantic segmentation aims to recognize novel classes at the point level. Previous methods mainly transfer excellent zero-shot generalization capabilities from images to point clouds. However, directly transferring knowledge from image to point clouds faces two ambiguous problems. On the one hand, 2D models will generate wrong predictions when the image changes. On the other hand, directly mapping 3D points to 2D pixels by perspective projection fails to consider the visibility of 3D points in camera view. The wrong geometric alignment of 3D points and 2D pixels causes semantic ambiguity. To tackle these two problems, we propose a framework named Affinity3D that intends to empower 3D semantic segmentation models to perceive novel samples. Our framework aggregates instances in 3D and recognizes them in 2D, leveraging the excellent geometric separation in 3D and the zero-shot capabilities of 2D models. Affinity3D involves an affinity module that rectifies the wrong predictions by comparing them with similar instances and a visibility module preventing knowledge transfer from visible 2D pixels to invisible 3D points. Extensive experiments have been conducted on SemanticKITTI datasets. Our framework achieves state-of-the-art performance in two settings.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: In this work, we proposed a generalized zero-shot 3D semantic segmentation framework for progressively transferring knowledge from the image to the point clouds. We utilized multimodal data to enhance the zero-shot capability of the 3D semantic segmentation network. The modalities used include text, images, and 3D point clouds. Due to the different data characteristics of the three modalities, a carefully designed training process was required to utilize them. The training framework improved 3D semantic segmentation performance without introducing additional parameters at inference time. Our framework generated more accurate pseudo labels at the instance level than previous methods. The proposed affinity module enhanced the quality of pseudo labels by propagating similarity to other samples. The proposed visibility module measured the visibility of 3D points in camera view by comparing the depth of points with the corresponding superpixel's depth. It substantially softened semantic ambiguity by ignoring the invisible 3D points when transferring knowledge from images to point clouds. Our framework improved the SemanticKITTI dataset under generalized zero-shot and annotation-free settings. It got 63.65% hIoU under the generalized zero-shot setting and 18.48% mIoU under the annotation-free setting.
Supplementary Material: zip
Submission Number: 138
Loading