Abstract: Current weakly supervised point cloud semantic segmentation struggles with insufficient utilization of limited annotations in unimodal representation learning due to the sparse and textureless nature of point clouds. In this work, we leverage cross-modality information by transferring knowledge from image and text sources to the point cloud network. The intuition is that images contribute rich texture, color, and discriminative information, complementing point clouds to boost semantic segmentation performance. To reduce extensive computational resources for cross-modality fusion, we introduce the Multi-Scale Deformable Knowledge Transfer, an innovative training scheme that optimizes and extends the one-to-one mapping to flexible one-to-many relations between multi-modal data. Furthermore, we employ pre-trained image-text models to generate pseudo labels for point clouds and construct positive and negative samples for semantic contrastive regularization, facilitating the full exploitation of unlabeled data. The experimental results evaluated on SemanticKITTI and nuScenes demonstrate substantial improvements, achieving an average gain of 3.8% over the previous weakly supervised methods, and comparable performances to fully supervised approaches.
Loading