Abstract: Camera relocalization methods range from dense image alignment to direct camera pose regression from a query image. Among these, sparse feature matching stands out as an efficient, versatile, and generally lightweight approach with numerous applications. However, feature-based methods often struggle with significant viewpoint and appearance changes, leading to matching failures and inaccurate pose estimates. To overcome these limitations, we propose a novel approach that leverages a globally sparse but locally dense 3D representation of 2D features. By tracking and triangulating landmarks over a sequence of frames, we construct a sparse voxel map optimized to render image patch descriptors observed during tracking. Given an initial pose estimate, we first synthesize descriptors from the voxels using volumetric rendering and then perform feature matching to estimate the camera pose. This method enables the generation of descriptors for unseen views, enhancing robustness to viewpoint changes. We evaluate our method on the 7-Scenes and Cambridge Landmarks datasets. Our results show that our approach significantly outperforms existing state-of-the-art feature representation techniques in indoor environments, achieving up to a 39% improvement in median translation error. Additionally, our approach yields comparable results to other methods for outdoor scenes but with lower computational and memory footprints.
External IDs:dblp:conf/wacv/PolizziC0K25
Loading