Keywords: 3D reconstruction
Abstract: Current end-to-end multi-view 3D reconstruction methods achieve impressive results, but are built on a restrictive assumption: the scene is entirely static with dense correspondence.
This reliance on idealized inputs causes even the most advanced methods to fail in real-world settings, where transient distractors and occlusions present. To address this, we propose \emph{Visual Geometry Transformer in the Wild} (VGTW), an end-to-end framework for robust reconstruction from inconsistent views. At its core, we isolate and suppress distractor-affected regions while preserving the consistent components across views. Specifically, we introduce a distractor-aware training strategy that separates clean features from distractor-contaminated ones in the attention mechanism while enforcing feature consistency across images. To enable this, we train the model with an auxiliary mask prediction head, using supervision from a new dataset we collected with pixel-level distractor masks. The resulting VGTW model is a feed-forward network that directly outputs clean, distractor-free point clouds. Remarkably, it requires no additional 3D supervision, remains computationally efficient, and is compatible with existing pipelines.
Extensive experiments validate our approach, demonstrating state-of-the-art performance and robust generalization in diverse, real-world scenarios.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3687
Loading