Injecting Geometric Scene Priors into Vision Transformers for Improved 2D-3D Understanding

Published: 20 Aug 2025, Last Modified: 23 Aug 2025SP4VEveryoneRevisionsBibTeXCC BY 4.0
Keywords: geometric vision, transformer, depth estimation, surface normals, self-attention, multi-task learning, 3D understanding
TL;DR: GeoViT integrates geometric priors into vision transformers via novel tokenization and attention mechanisms, achieving SOTA depth/normal estimation while maintaining efficiency, with 12\% RMSE improvement on NYU Depth v2.
Abstract: This paper presents GeoViT, a novel vision transformer architecture that integrates geometric scene priors (depth, surface normals) through three key innovations: (1) geometry-aware tokenization, (2) physically-informed attention mechanisms, and (3) consistency-preserving loss functions. Our method achieves state-of-the-art performance on NYU Depth v2 (12\% RMSE improvement) and ScanNet (15\% normal estimation error reduction) while maintaining computational efficiency (22.4 FPS). The proposed adaptive parameter scheduling enables stable training with 94\% success rate, outperforming existing approaches that either ignore geometric constraints or apply them rigidly. Experiments demonstrate significant advantages in both accuracy and generalization, particularly for textureless regions and complex indoor scenes where pure data-driven methods fail.
Submission Number: 25
Loading