Qwen-3D: A Generalist 3D Vision-Language Model for Spatial Understanding

Published: 21 May 2026, Last Modified: 21 May 2026CVPR 2026 Workshop OpenSUN3D PosterEveryoneRevisionsCC BY 4.0
Keywords: Vision, language, and reasoning
Abstract: Recent Large Multimodal Models (LMMs) achieve impressive performance on images and short videos, but long video reasoning remains computationally expensive and temporally inconsistent due to frame-level tokenization and limited context windows. We observe that 3D geometry provides a natural compression mechanism for multi-view visual streams. Geometric signals such as depth and camera pose allow RGB frames to be fused into persistent world-aligned representations, enabling efficient reasoning over space and time. Motivated by this insight, we introduce Qwen3D, a geometry-aware LMM that leverages multi-view geometric signals to compress visual tokens within the Qwen backbone, enabling efficient processing of long and highly redundant video sequences. By modifying visual tokens with 3D Rotary Positional Embeddings, Qwen3D} performs attention in the world space rather than over independent image frames, enabling efficient cross-view reasoning. Qwen3D further integrates a query-based segmentation decoder that grounds language tokens directly in visual space, allowing unified reasoning across referential grounding, instance segmentation, and visual question answering for both images and videos. Across a broad suite of benchmarks, Qwen3D outperforms large-scale proprietary 2D models and existing 3D LMM approaches. Finally, we show that joint training on both 2D and 3D data preserves strong 2D vision–language capabilities while substantially improving 3D reasoning.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 21
Loading