Visual Geometry Transformer in the Wild: Distractor-Free 3D Reconstruction

10 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D reconstruction
Abstract: Current end-to-end multi-view 3D reconstruction methods achieve impressive results, but are built on a restrictive assumption: the scene is entirely static with dense correspondence. This reliance on idealized inputs causes even the most advanced methods to fail in real-world settings, where transient distractors and occlusions present. To address this, we propose \emph{Visual Geometry Transformer in the Wild} (VGTW), an end-to-end framework for robust reconstruction from inconsistent views. At its core, we isolate and suppress distractor-affected regions while preserving the consistent components across views. Specifically, we introduce a distractor-aware training strategy that separates clean features from distractor-contaminated ones in the attention mechanism while enforcing feature consistency across images. To enable this, we train the model with an auxiliary mask prediction head, using supervision from a new dataset we collected with pixel-level distractor masks. The resulting VGTW model is a feed-forward network that directly outputs clean, distractor-free point clouds. Remarkably, it requires no additional 3D supervision, remains computationally efficient, and is compatible with existing pipelines. Extensive experiments validate our approach, demonstrating state-of-the-art performance and robust generalization in diverse, real-world scenarios.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3687
Loading