Advancing 3D Object Grounding Beyond a Single 3D Scene

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: As a widely explored multi-modal task, 3D object grounding endeavors to localize a unique pre-existing object within a single 3D scene given a natural language description. However, such a strict setting is unnatural as it is not always possible to know whether a target object actually exists in a specific 3D scene. In real-world scenarios, a collection of 3D scenes is generally available, some of which may not contain the described object while some potentially contain multiple target objects. To this end, we introduce a more realistic setting, named Group-wise 3D Object Grounding, to simultaneously process a group of related 3D scenes, allowing a flexible number of target objects to exist in each scene. Instead of localizing target objects in each scene individually, we argue that ignoring the rich visual information contained in other related 3D scenes within the same group may lead to sub-optimal results. To achieve more accurate localization, we propose a baseline method named GNL3D, a Grouped Neural Listener for 3D grounding in the group-wise setting, which extends the traditional 3D object grounding pipeline with a novel language-guided consensus aggregation and distribution mechanism to explicitly exploit the intra-group visual connections. Specifically, based on context-aware spatial-semantic alignment, a Language-guided Consensus Aggregation Module (LCAM) is developed to aggregate the visual features of target objects in each 3D scene to form a visual consensus representation, which is then distributed and injected into a consensus-modulated feature refinement module for refining visual features, thus benefiting the subsequent multi-modal reasoning. Furthermore, we design a curriculum strategy to promote the LCAM to learn step by step how to extract effective visual consensus with the existence of negative 3D scenes where no target object exists. To validate the effectiveness of the proposed method, we reorganize and enhance the ReferIt3D dataset and propose evaluation metrics to benchmark prior work and GNL3D. Extensive experiments demonstrate GNL3D achieves state-of-the-art results on both the group-wise setting and the traditional 3D object grounding task.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Experience] Multimedia Applications, [Content] Multimodal Fusion, [Content] Vision and Language
Relevance To Conference: 3D object grounding is a widely explored multi-modal task involving 3D vision and language modalities. In this work, we formalize a novel group-wise setting for this task and present a baseline method named GNL3D to explore the flexible number object grounding in a group of 3D scenes, which advances user-specified 3D object grounding towards more practical multimedia applications.
Supplementary Material: zip
Submission Number: 1792
Loading