Abstract: Recently, pose estimation under sparse views ($\leq 10$) has witnessed significant advances with the development of deep learning. Most existing methods directly regress the absolute poses, demonstrating leading performance on benchmarks. However, directly regressing the scaled poses using deep neural networks is inherently ill-posed, resulting in overfitted models that perform poorly on diverse scenarios. In contrast, we resort to the well-posed solutions from traditional Structure-from-Motion (SfM) pipelines and propose LoopRefine, a diffusion model that assumes known camera intrinsics and estimates pairwise normalized camera relative poses and utilizes triplet coplanar constraints to align the scale of camera poses. Like traditional SfM methods, LoopRefine incrementally constructs camera triplets, and the scale ambiguities are resolved by gradually recovering the scale of poses and connecting the pose graph. To further improve the pose estimation accuracy during inference, we explore pose compatibility by randomly chaining the loop transformations on the pose graph and organizing iterative loop consistency-based optimization. Extensive experiments demonstrate the superiority of our method, and the generalization performance on both object-centered datasets and scene datasets also proves the effectiveness of integrated geometric constraints.
External IDs:dblp:journals/ral/WangDSXLLD25
Loading