RODIN: Injecting 2D Foundational Features to 3D Vision Language Understanding

ICLR 2025 Conference Submission808 Authors

14 Sept 2024 (modified: 21 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D vision-language understanding
Abstract: We present RODIN (Referential ODIN), a novel model for 3D vision-language understanding that directly operates on posed RGB-D frames. Consuming posed RGB-D from sensors, such as those from an iPhone, simplifies and speeds up inference compared to existing models that train and test using pointclouds sampled from a reconstructed mesh provided by a dataset. We hypothesize that existing approaches consume pointclouds sampled from mesh instead of sensor RGB-D point clouds due to inaccurate camera poses in existing 3D grounding benchmarks, and show that using the "sensor" pointclouds indeed leads to a 5-10\% drop in performance on 3D referential grounding, for these methods. Yet sensor noise is unavoidable in real-world settings. RODIN instead addresses this with a scalable, end-to-end architecture for various 3D vision-language tasks. Specifically, RODIN combines powerful pretrained 2D weights trained on internet-scale data, adapts them to a 2D-3D encoder using the recently proposed ODIN, and combines that backbone with a proposed 3D mask-language decoder based on the Mask2Former used in SAM. RODIN achieves state-of-the-art performance on multiple 3D vision-language benchmarks, including referential grounding (SR3D, NR3D, ScanRefer), language-prompted object detection (ScanNet200 and Matterport3D), and question-answering (ScanQA and SQA3D). It outperforms previous methods for 3D vision-language tasks, despite consuming only sensor inputs. Because of its combination of effectively leveraging 2D pretrained architectures and finetuning end-to-end on sensor data, RODIN provides a scalable solution for embodied 3D perception.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 808
Loading