ES-GGT: Efficient Submap-based Visual Geometry Grounded Transformer with Spatial Memory Alignment

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Scene Reconstruction, Stream Reconstruction, Multi-view Stereo
TL;DR: ES-GGT, an efficient method for streaming scene reconstruction
Abstract: Foundation models have recently emerged as powerful tools in 3D vision, greatly advancing the field of 3D perception. However, improving computational efficiency while maintaining consistency in long sequences remains a key challenge in computer vision. We present EG-GGT, an efficient method for streaming scene reconstruction built on VGGT, a state-of-the-art feed-forward visual geometry model. We align submaps in a streaming manner using a hierarchical, local-to-global strategy. For local submaps, we perform fine-grained alignment of their scales and coordinate systems by streaming low-level information, thereby reducing computational complexity while maintaining memory cost and performance comparable to simultaneous input of all submaps. For global submaps, we integrate high-level spatial memory with a tri-perspective view (TPV) representation that extends the bird’s-eye view (BEV) with two orthogonal planes. We then generate a 15-degrees-of-freedom homography transformation matrix to achieve global alignment. We significantly improved inference speed and efficiently handled long sequence inputs.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4311
Loading