Keywords: 3D language grounding, Pretraining, Training 3D model using 2D
Abstract: We present an approach to train a 3D Vision Language Mask Grounding model without requiring 3D supervision.
This is achieved by reconstructing the Gaussian splatting field from the input point cloud, and supervising the mask decoder model using 2D labels and losses.
This pipeline allows us to distill the knowledge from powerful 2D foundation models to 3D grounding models, demonstrating impressive performance for both zero-shot and pretraining.
We show this approach outperforms baselines/SOTA for 3D vision-language grounding, and also outperforms other 3D pretraining techniques.
Submission Number: 6
Loading