Keywords: 3D Language Grounding, 2D->3D Feature Distillation, Large-Scale Multi-View Dataset
TL;DR: We propose a multi-view feature fusion strategy that employs object-centric priors to eliminate uninformative views and produce crisp 3D feature-clouds. We generate, train and validate our approach in a large-scale synthetic multi-view dataset.
Abstract: Grounding natural language to the physical world is a ubiquitous topic with a wide
range of applications in computer vision and robotics. Recently, 2D vision-language
models such as CLIP have been widely popularized, due to their impressive capa-
bilities for open-vocabulary grounding in 2D images. Subsequent works aim to
elevate 2D CLIP features to 3D via feature distillation, but either learn neural fields
that are scene-specific and hence lack generalization, or focus on indoor room
scan data that require access to multiple camera views, which is not practical in
robot manipulation scenarios. Additionally, related methods typically fuse features
at pixel-level and assume that all camera views are equally informative. In this
work, we show that this approach leads to sub-optimal 3D features, both in terms
of grounding accuracy, as well as segmentation crispness. To alleviate this, we
propose a multi-view feature fusion strategy that employs object-centric priors to
eliminate uninformative views based on semantic information, and fuse features
at object-level via instance segmentation masks. To distill our object-centric 3D
features, we generate a large-scale synthetic multi-view dataset of cluttered tabletop
scenes, spawning 15k scenes from over 3300 unique object instances, which we
make publicly available. We show that our method reconstructs 3D CLIP features
with improved grounding capacity and spatial consistency, while doing so from
single-view RGB-D, thus departing from the assumption of multiple camera views
at test time. Finally, we show that our approach can generalize to novel tabletop
domains and be re-purposed for 3D instance segmentation without fine-tuning, and
demonstrate its utility for language-guided robotic grasping in clutter.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9773
Loading