Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention

Published: 25 Sept 2024, Last Modified: 20 Dec 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D grounding, vision and language
Abstract: Multi-object 3D Grounding involves locating 3D boxes based on a given query phrase from a point cloud. It is a challenging and significant task that has numerous applications in visual understanding, human-computer interaction, and robotics. To tackle this challenge, we introduce D-LISA, a two-stage approach that incorporates three innovations. First, a dynamic vision module that enables a variable and learnable number of box proposals. Second, a dynamic camera positioning that extracts features for each proposal. Third, a language-informed spatial attention module that better reasons over the proposals to output the final prediction. Empirically, experiments show that our method outperforms the state-of-the-art methods on multi-object 3D grounding by 12.8% (absolute) and is competitive in single-object 3D grounding.
Supplementary Material: zip
Primary Area: Machine vision
Submission Number: 2914
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview