OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

18 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Point Cloud Understanding, Open-world understanding, 3D scene understanding, 3D deep learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: OpenIns3D is a powerful, 2D input-free, fast-evolving, complex-input-handling framework for 3D open-world scene understanding. SOTA results ARE achieved on indoor and outdoor benchmarks
Abstract: Current 3D open-vocabulary scene understanding methods mostly utilize well- aligned 2D images as the bridge to learn 3D features with language. However, applying these approaches becomes challenging in scenarios where 2D images are absent. In this work, we introduce a new pipeline, namely, OpenIns3D, which requires no 2D image inputs, for 3D open-vocabulary scene understanding at the instance level. The OpenIns3D framework employs a “Mask-Snap- Lookup” scheme. The “Mask” module learns class-agnostic mask proposals in 3D point clouds. The “Snap” module generates synthetic scene-level images at multiple scales and leverages 2D vision language models to extract interesting objects. The “Lookup” module searches through the outcomes of “Snap” with the help of Mask2Pixel maps, which contain the precise correspondence between 3D masks and synthetic images, to assign category names to the proposed masks. This 2D input-free and flexible approach achieves state-of-the-art results on a wide range of indoor and outdoor datasets by a large margin. Moreover, OpenIns3D allows for effortless switching of 2D detectors without re-training. When integrated with powerful 2D open-world models such as ODISE and GroundingDINO, excellent results were observed on open-vocabulary instance segmentation. When integrated with LLM-powered 2D models like LISA, it demonstrates a remarkable capacity to process highly complex text queries which require intricate reasoning and world knowledge. The code and model will be made publicly available.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1445
Loading