VGPA: Deep View-Graph Pose Averaging for Structure-from-Motion

VGPA: Deep View-Graph Pose Averaging for Structure-from-Motion

ICLR 2026 Conference Submission17764 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: SfM, Camera pose estimation, 3D Reconstruction

Abstract: Camera pose estimation is a key step in 3D reconstruction and view-synthesis pipelines. We present a deep, global Structure-from-Motion framework based on learned view-graph aggregation. Our method employs a permutation-equivariant, edge-conditioned graph neural network that takes noisy pairwise relative poses as input and outputs globally consistent camera extrinsics. The network is trained without ground-truth supervision, relying solely on a relative-pose consistency objective. This is followed by 3D point triangulation and robust bundle adjustment. A fast view re-integration step increases camera coverage by reintroducing discarded images. Our approach is efficient, scalable to more than a thousand images, and robust to graph density. We evaluate our method on MegaDepth, 1DSfM, Strecha, and BlendedMVS. These experiments demonstrate that our method achieves superior rotation and translation accuracy compared to deep track-centric methods while registering more images across many scenes, and competitive results compared to state-of-the-art classical pipelines, while being much faster.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 17764

Loading