Abstract: Severe appearance changes represent a pervasive and intricate challenge within Visual Place Recognition (VPR) tasks, and the current best solution adopts a composite strategy encompassing global retrieval and reranking. However, these reranking techniques necessitate sophisticated considerations to extract and match local features, which leads to a notable escalation of computational resource demands and inference duration. To this end, we propose a novel framework unifying global and local features within a single pipeline network, representing a simple solution capable of seamlessly operating across diverse scenarios without other fussy structures. Specifically, our overall thought involves training discriminative global features via image classification techniques, concurrently extracting effective local features directly from the intermediate layers without extra operations. To augment the expressiveness of features, we introduce multi-layer Convolutional Neural Network (CNN) feature maps to fuse diverse semantic information. Concurrently, a Transformer with relative position encoding is employed to capture cross-layer long-range and positional correlations. In conjunction with the associated attention values, low-resolution feature maps lessen features involved in the matching, resulting in decreased computational overhead and a remarkable acceleration of reranking. Extensive experimentations showcase that our model achieves State-Of-The-Art (SOTA) performance across datasets with severe appearance changes, the fastest inference duration and minimal memory usage.
External IDs:doi:10.1109/lra.2024.3376967
Loading