Keywords: Representation Learning, Language-3D, Self-supervised Learning
Abstract: Recent years have seen significant advancements in large-scale representation learning of 2D vision and language tasks. However, the efficacy of such cross-modal training on large-scale 3D objects with other modalities (such as text and images) remains primarily unexplored. We introduce MaskCL3D, an efficient and powerful method for language-3D representation learning. By employing contrastive and masked reconstruction learning on 3D point clouds with language descriptions and multi-view images, MaskCL3D significantly boosts training and testing efficiency. Meanwhile, we collect a large-scale language-3D dataset covering a wide array of objects and descriptions. Models trained on our dataset consistently outperform those trained on alternative datasets. Compared to existing baselines, our approach trained on the new dataset achieves state-of-the-art performance on zero-shot classification and retrieval tasks. We have performed a series of analytical studies on the learned language and 3D representations and find that these representations contain rich semantic information, which is crucial for interpreting and correlating intricate concepts within 3D environments.
Submission Number: 18
Loading