Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

19 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeX
Keywords: 3D point cloud learning, multi-modality learning, large language model
Abstract: With the growing diversity of large-scale data, learning from multi-modality has attained notable progress in language and 2D vision. However, in 3D domains, how to develop an all-purpose multi-modal framework is still under-explored. To this end, we introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, and audio. Guided by ImageBind, we construct a joint embedding space between 3D and multi-modalities, enabling many promising applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D open-world understanding. On top of this joint embedding space, we further present Point-LLM, a 3D large language model (LLM) following 3D and multi-modal instructions. Without any 3D instruction data, our Point-LLM injects the semantics of Point-Bind into pre-trained LLMs, e.g., LLaMA, and exhibits superior 3D and multi-modal question-answering capacity. We have conducted extensive experiments to demonstrate the effectiveness and generalizability of our approach for aligning 3D and multi-modality.
Supplementary Material: pdf
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1622
Loading