Fine-grained text and image guided point cloud completion with CLIP model

Jun Zhou, Wei Song, Mingjie Wang, Hongchen Tan, Nannan Li, Xiuping Liu

Published: 2025, Last Modified: 18 Oct 2025Neurocomputing 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this work, we propose a novel multimodal fusion network for point cloud completion, which can simultaneously combine visual and textual information to accurately predict the semantic and geometric characteristics of incomplete shapes in a more generalizable manner. Specifically, to overcome the limited prior information due to the small-scale dataset, we utilize a pre-trained vision-language model trained on a large dataset of image-text pairs. Consequently, the textual and visual encoders of this large-scale model exhibit a stronger generalization ability. Then, we introduce a multi-stage feature fusion strategy to progressively incorporate textual and visual features into the backbone network. To further explore the effectiveness of fine-grained text descriptions, we build a text corpus with fine-grained descriptions, providing richer geometric details for 3D shapes. These detailed text descriptions are used for training and evaluating our network. Extensive quantitative and qualitative experiments demonstrate the superior performance of our method compared to state-of-the-art point cloud completion networks. Code is available at https://github.com/songwei100110/FTPNet.

External IDs:dblp:journals/ijon/ZhouSWTLL25