Keywords: 3D visual grounding
Abstract: The recent development of Large Language Models (LLMs) with strong reasoning ability has driven research in various domains such as mathematics, coding, and scientific reasoning. Meanwhile, 3D visual grounding, as an important task in the 3D understanding area, still remains challenging due to the limited reasoning ability of recent 3D visual grounding models. Most of the current methods incorporate a text encoder and visual feature encoder to generate cross-modal fuse features and predict the referring object. These models often require supervised training on extensive 3D annotation data. Meanwhile, recent zero-shot visual grounding methods require access to proprietary LLM during test time, which leads to large inference costs. To overcome these limitations, we propose a 3D visual grounding data generation pipeline capable of synthesizing 3D scenes along with corresponding target queries and ground truth answers for training visual grounding models. Additionally, we leverage the generated visual grounding data and conduct post-training on Llama-3.1-8B-Instruct, resulting in a strong 3D visual grounding LLM that outperforms existing SoTA zero-shot methods, demonstrating its effectiveness.
Submission Number: 2
Loading