Abstract: We propose a visual re-localization pipeline that achieves high-precision localization in large-scale, street-level environments. Our method establishes a coarse-to-fine cross-modal alignment network between 2D image features and 3D implicit features retrieved from NeRF. By directly sampling 3D features from NeRF, we bypass the traditional rendering process, which significantly accelerates inference time with high efficiency. Furthermore, we design a novel loss function for static street-scene reconstruction, effectively mitigating potential interference from dynamic objects, such as vehicles and pedestrians. Extensive experiments on the KITTI dataset are conducted, where we compare our method with several open-source algorithms. The results demonstrate the effectiveness and robustness of our proposed method in real-world urban re-localization. We will release the codes upon publication.
External IDs:dblp:conf/rcar/ShaLWHWX25
Loading