Abstract: Highlights•We introduce a novel Text4Point framework to construct language-guided 3D point cloud models.•The key idea is to use 2D images as a bridge to connect the point cloud and the language modalities.•Text4Point utilizes dense contrastive learning to align image and point cloud representations with the readily available RGB-D data.•We propose a Text Querying Module to integrate language information into 3D representation learning.•Extensive experiments demonstrate that Text4Point consistently improves performance on various dense prediction tasks.
Loading