Zero-Shot 3D Visual Grounding from Vision-Language Models

Rong Li; Shijie Li; Lingdong Kong; Xulei Yang; Junwei Liang

Zero-Shot 3D Visual Grounding from Vision-Language Models

Rong Li, Shijie Li, Lingdong Kong, Xulei Yang, Junwei Liang

Published: 09 May 2025, Last Modified: 09 May 20253D-LLM/VLA PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: 3D Visual Grounding, LLM, VLM, Zero-Shot Learning

TL;DR: SeeGround is a VLM-assisted framework for zero-shot open-vocabulary 3D visual grounding.

Abstract: 3D Visual Grounding (3DVG) aims to localize objects in 3D scenes based on textual descriptions, enabling applications in augmented reality and robotics. Traditional approaches rely on annotated 3D datasets and predefined object categories, limiting scalability. We introduce SeeGround, a zero-shot 3DVG framework leveraging 2D Vision-Language Models (VLMs). We bridge 3D scenes and 2D-VLM inputs via a hybrid representation of query-aligned rendered images and spatially enriched text. It introduces two key modules: the Perspective Adaptation Module for dynamic viewpoint selection, and the Fusion Alignment Module for aligning visual and spatial features to enhance localization. Extensive experiments on ScanRefer and Nr3D demonstrate that our approach outperforms existing zero-shot methods by large margins. Notably, we exceed weakly supervised methods and rival some fully supervised ones, outperforming previous SOTA by 7.7% on ScanRefer and 7.1% on Nr3D, showcasing its effectiveness in complex 3DVG tasks. The code will be made publicly available.

Submission Number: 4

Loading