MFap: Multi-view Feature Map for Visual Localization

Published: 01 Jan 2024, Last Modified: 12 Apr 2025ICIRA (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The core of traditional visual localization methods is local feature matching. However, local feature matching relies on the similarity of fine details and structures in local content. In practical applications, there are often scenarios where the difference in camera poses is too large, causing the repeated regions between images to almost disappear and making it impossible to obtain enough similar local features. An example of this is ground-to-aerial view matching. In such cases, localization based on local feature matching often fails to achieve precise positioning. Images from different viewpoints have distinct characteristics: ground views usually have higher resolution and clearer details but a smaller field of view, presenting more specific content information; aerial views have lower resolution but contain rich semantic and scene structure information. These complementary characteristics from different perspectives are advantageous. Therefore, for scenarios with significant viewpoint differences, this paper proposes a ground-to-aerial multi-view fusion method based on existing scene regression techniques. The method leverages known camera poses to fuse multi-view images into a Bird’s Eye View (BEV) feature map that contains rich scene information. This provides a global top-down perspective to help understand and analyze objects and structures in the environment. Matching is performed in the BEV feature space, directly regressing to obtain the pose of the query image, thus achieving precise localization. Extensive experiments have been conducted on this method, comparing it with current local feature matching methods. Results show that in scenarios with large viewpoint differences, this method surpasses traditional localization methods, achieving precise and robust localization.
Loading