Unifying 2D and 3D Vision-Language Understanding

Ayush Jain; Alexander Swerdlow; Yuzhou Wang; Sergio Arnaud; Ada Martin; Alexander Sax; Franziska Meier; Katerina Fragkiadaki

Unifying 2D and 3D Vision-Language Understanding

Ayush Jain, Alexander Swerdlow, Yuzhou Wang, Sergio Arnaud, Ada Martin, Alexander Sax, Franziska Meier, Katerina Fragkiadaki

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0

TL;DR: UniVLG unifies 2D and 3D vision-language tasks, transferring 2D knowledge to the 3D domain.

Abstract: Progress in 3D vision-language learning has been hindered by the scarcity of large-scale 3D datasets. We introduce UniVLG, a unified architecture for 2D and 3D vision-language understanding that bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems. Our approach initializes most model weights from pre-trained 2D models and trains on both 2D and 3D vision-language data. We propose a novel language-conditioned mask decoder shared across 2D and 3D modalities to ground objects effectively in both RGB and RGB-D images, outperforming box-based approaches. To further reduce the domain gap between 2D and 3D, we incorporate 2D-to-3D lifting strategies, enabling UniVLG to utilize 2D data to enhance 3D performance. With these innovations, our model achieves state-of-the-art performance across multiple 3D vision-language grounding tasks, demonstrating the potential of transferring advances from 2D vision-language learning to the data-constrained 3D domain. Furthermore, co-training on both 2D and 3D data enhances performance across modalities without sacrificing 2D capabilities. By removing the reliance on 3D mesh reconstruction and ground-truth object proposals, UniVLG sets a new standard for realistic, embodied-aligned evaluation. Code and additional visualizations are available at https://univlg.github.io.

Lay Summary: Most real-world robots use 3D sensors but still rely on models trained only on 2D images, missing out on the full benefits of 3D perception. This is mainly because high-quality 3D training data is scarce and expensive. We introduce \model{}, a new vision-language model that combines 2D and 3D data to bridge this gap. \model{} uses powerful pre-trained 2D models and learns to understand 3D scenes by aligning 2D and 3D inputs. A key innovation is a language-guided mask decoder that accurately grounds objects in 3D space. Our model outperforms previous methods on major benchmarks while working in more realistic, sensor-based settings. This shows that leveraging 2D data is a practical and effective way to boost 3D understanding in embodied AI systems.

Link To Code: https://github.com/facebookresearch/univlg

Primary Area: Applications->Computer Vision

Keywords: 3D Language grounding

Submission Number: 2942

Loading