Boosting 3D Visual Grounding by Object-Centric Referring Network

Ruilong Ren, Jian Cao, Weichen Xu, Tianhao Fu, Yilei Dong, Xinxin Xu, Zicong Hu, Xing Zhang

Published: 2024, Last Modified: 25 Jul 2025IROS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: 3D visual grounding is tasked with locating a specific object within a 3D scene, as described by a given textual reference. This task is challenging because it requires (1) the accurate recognition of various objects in a 3D scene and (2) the understanding of spatial relations in the description. However, current studies encounter difficulties in situations where multiple similar objects are present or when the descriptions involve intricate and abstract relations. In this paper, a novel, simple, and efficient Object-Centric Referring network, namely 3D-OCR, is presented to take high-quality semantic representation and deep relation modeling into account. Specifically, an offline Fine-grained Semantic Enhancement (FSE) module is designed to reinforce the object-centric semantic awareness with fine-grained high-quality object semantic representations. To achieve superior object-centric relation awareness, we propose a Deep Relation Modeling (DRM) module with the explicit and implicit relation self-attention module, enriching object features with relational context. Moreover, we utilize a vision-language contrastive loss to further improve the matching process between point cloud and language. Comprehensive experiments conducted on the challenging ScanRefer and Nr3D datasets corroborate the exceptional performance of our method, with an increase of +1.47% on ScanRefer and +1.2% on Nr3D.