BoxGrounder: 3D Visual Grounding Using Object Size Estimates

Published: 01 Jan 2024, Last Modified: 07 Nov 2025IEEE Robotics Autom. Lett. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recent advances in simultaneous localization and mapping (SLAM) systems have significantly enhanced the process of creating 3D digital replicas of real-world environments. Numerous applications utilizing these digital twins generally necessitate object-level annotations, which are challenging to acquire. This is especially true in scenes with novel objects, where pre-trained detection models are unavailable. In this letter, we address this issue by introducing BoxGrounder, a method for grounding objects in 3D point clouds using object size estimates. Specifically, we suggest to increase the involvement of the user during the data acquisition process, by concurrently recording speech segments that contain the user's estimates about the size of relevant objects in the scene. We formulate the grounding problem as a region growing scheme of geometric primitives, taking into account the user's estimate. Our approach improves upon the baseline methods on our real-world dataset as well as on ScanNet, and we provide evidence for the practical feasibility of our method through a user study.
Loading