SSCCPC-Net: Simultaneously Learning 2D and 3D Features with CLIP for Semantic Scene Completion on Point Cloud

Published: 2024, Last Modified: 13 Nov 2025CGI (3) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Compare to traditional Scene Completion (SC), Semantic Scene Completion (SSC) is a challenging task that aims to generate complete and semantically consistent 3D scene from partial and sparse input data, which is fundamental to fully understanding the scene and being able to interact with it. Consequently, the SSC task has received much attention in recent years. Most of the methods are voxel-based approaches, but they have high computational and memory requirements. A few works based on point cloud do not sufficiently exploit the correlation between semantic segmentation and geometric completion subtasks, while focusing too much on point cloud shape features and ignoring the rich texture information that RGB images can provide. In this paper, we present SSCCPC-Net (Semantic Scene Completion with CLIP on Point Cloud-Net), a novel network architecture for point cloud semantic scene completion using a combination of 2D and 3D features. Inspired by recent works of large pretrained vision-language models in semantic segmentation, we explore to accomplish SSC task with the help of Contrastive Language-Image Pre-Training (CLIP) model. Specifically, we use the CLIP features for guidance to fuse the 2D features extracted from the RGB image and the 3D features extracted from the point cloud. The fused features are then fed into our designed Semantic-Completion Decoder for per-point semantic prediction and semantic labeling-assisted point cloud completion. Finally, we obtain the complete semantically point cloud. Numerous experiments have demonstrated that our method has higher effectiveness and generalizability compared to state-of-the-art methods.
Loading