Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Natural Language; 3D Perception; Cross-modality Knowledge
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Typical 3D perception approaches are inclined to learn a well-performed network via supervised training or pretraining-finetuning. Either way, they only explore in-modality solutions and data. In this work, we introduce a cross-modal strategy that applies pretrained language models for understanding 3D point clouds, given that both point clouds and texts are discrete data. The language model is trained on language corpus and frozen. We propose a simple yet effective approach, named LAMP (LAnguage Models can read Point clouds), which merely trains a small portion of parameters to align the data distribution of 3D point clouds with pretrained language models and spark the 3D perception ability of language models. Furthermore, we utilize the 3D-aware language model to simultaneously extract features of point clouds and texts, which mitigates the modality gap and boosts the performance on multimodal tasks, e.g., 3D visual grounding. Extensive experiments on unimodal and multimodal tasks validate the effectiveness of our proposed method.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1936
Loading