Abstract: Visual localization is the core technology in fields such as autonomous robotics, virtual/augmented reality, and geo-localization, as a means to localize objects using visual information. Recent image retrieval approaches have displayed superior scalability in unseen large-scale urban environments with limited accuracy, as opposed to structure-based and matching-based approaches. Despite their precision, matching-based models tend to be over-reliant on the training process and unable to perform on large and sparse datasets. Our study introduces FrustumVL, a novel framework designed to amalgamate image retrieval and matching-based models’ strengths through a two-stage coarse-to-fine approach. This integration aims to develop a versatile model to operate effectively across urban environments of varying scales. Furthermore, we propose a novel mining strategy predicated on frustum overlap called Frustum Miner to enhance the compatibility among the incorporated models within the framework. Our experiments demonstrated that, on unseen data, our proposed approach yields the lowest localization error on benchmarks such as the Pittsburgh250k-test [18] and Cambridge Landmark [15] datasets when compared against SOTA retrieval-based and matching-based approaches. This result proves that our framework can achieve competitive performance when applied to any urban environment without training.
External IDs:dblp:conf/miwai/LeNPN24
Loading