From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs

Ang Cao; Sergio Arnaud; Oleksandr Maksymets; Jianing Yang; Ayush Jain; Ada Martin; Vincent-Pierre Berges; Paul McVay; Ruslan Partsey; Aravind Rajeswaran; Franziska Meier; Justin Johnson; Jeong Joon Park; Alexander Sax

From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs

Ang Cao, Sergio Arnaud, Oleksandr Maksymets, Jianing Yang, Ayush Jain, Ada Martin, Vincent-Pierre Berges, Paul McVay, Ruslan Partsey, Aravind Rajeswaran, Franziska Meier, Justin Johnson, Jeong Joon Park, Alexander Sax

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We use differentiable rendering to train 3D language grounding models

Abstract: 3D vision-language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes--a six-order-of-magnitude gap that severely limits performance. We introduce \textbf{\emph{LIFT-GS}}, a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views, enabling supervision from 2D foundation models (SAM, CLIP, LLaMA) without requiring any 3D annotations. This render-supervised formulation enables end-to-end training of complete encoder-decoder architectures and is inherently model-agnostic. LIFT-GS achieves state-of-the-art results with 25.7\% mAP on open-vocabulary instance segmentation (vs. 20.2\% prior SOTA) and consistent 10-30\% improvements on referential grounding tasks. Remarkably, pretraining effectively multiplies fine-tuning datasets by 2×, demonstrating strong scaling properties that suggest 3D VLG currently operates in a severely data-scarce regime. Project page: \url{https://liftgs.github.io}.

Lay Summary: 3D computer vision models-which help machines understand and interact with our 3D world- currently suffer from a severe data shortage. Unlike 2D models that can learn from billions of images, 3D models often rely on only thousands of labeled examples, making it difficult for them to perform well. Our work introduces a new method, called LIFT-GS, that uses 2D foundation models (such as SAM, CLIP, and LLaMA) to teach 3D models without needing expensive 3D labels. We do this by having our 3D model generate simple 3D primitives (called Gaussians) and then render them into 2D images that existing 2D models can supervise. This technique effectively lets the 3D model learn from the vast knowledge already stored in 2D models. With this approach, we achieved a significant boost in performance on challenging 3D tasks, such as instance segmentation and referential grounding. Our method also shows that by pretraining on 2D models, we can double the effective size of the 3D training data — a big step forward in overcoming the data bottleneck in 3D vision.

Primary Area: Applications->Computer Vision

Keywords: 3D language grounding, Pretraining, Training 3D model using 2D

Submission Number: 3875

Loading