LOCATE 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: A model that can localize objects in 3D from textual referring expressions.
Abstract: We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model. Code, models and dataset can be found at the project website: locate3d.atmeta.com
Lay Summary: For robot assistants to become commonplace and perform household tasks alongside humans, we will need to communicate via natural language. This means that the robot will need to be able to differentiate objects based on object names, descriptions, and spatial relationships to successfully ‘put away the pillows on the bed’. This work describes a system that is able to locate objects in 3D space from natural language input. Given the limited amount of 3D data with objects and descriptions, this work uses a three-fold approach. First, we leverage image models to incorporate knowledge from labeled and unlabeled 2D data. Second, we develop a new algorithm to learn from unlabeled 3D data. Finally, we develop a new approach to learn from labeled 3D data. Additionally, we release additional labeled 3D data. This system achieves state of the art performance on existing benchmarks that measure location accuracy in 3D based on natural language descriptions. We also show the system performs well in robotic use-cases.
Link To Code: https://github.com/facebookresearch/locate-3d/
Primary Area: Applications->Computer Vision
Keywords: self-supervised learning, object localization, referring expressions, 3D language grounding
Submission Number: 5047
Loading