Abstract: Visual Place Recognition (VPR) on natural image is challenging due to the illumination variance and seasonal changes. In terms of long-term localization, the emerging event stream cameras are naturally resilient to appearance changes. In this paper, we propose a novel multi-modal network, e.g. VEFNet for VPR by learning location-specific cross RGB-event modality feature representations. Specifically, we firstly extract dense visual features via shared Convolutional Neural Network (CNN) backbone from RGB and event frames separately. Then, two branch features are fed to the cross-modality attention module to establish correspondences between the dual-modality. We also employ a self-attention module to enhance the contextual integration within densely encoded features. Finally, the learned global descriptor is used as the place representation of the dual-modality inputs for VPR. Experimental results demonstrate the state-of-the-art (SOTA) performance on the public datasets
Loading