VXP: Voxel-Cross-Pixel Large-Scale Camera-LiDAR Place Recognition

Yun-Jin Li; Mariia Gladkova; Yan Xia; Rui Wang; Daniel Cremers

VXP: Voxel-Cross-Pixel Large-Scale Camera-LiDAR Place Recognition

Yun-Jin Li, Mariia Gladkova, Yan Xia, Rui Wang, Daniel Cremers

Published: 23 Mar 2025, Last Modified: 24 Mar 20253DV 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: place recognition, LiDAR, cross-modality, vision

TL;DR: Our novel camera-LiDAR place recognition method, VXP, enforces local feature similarities in an unsupervised manner and effectively brings global context from images and LiDAR scans into a shared latent space to enable accurate cross-modal retrieval.

Abstract: Cross-modal place recognition methods are flexible GPS-alternatives under varying environment conditions and sensor setups. However, this task is non-trivial since extracting consistent and robust global descriptors from different modalities is challenging. To tackle this issue, we propose \textit{Voxel-Cross-Pixel (VXP)}, a novel camera-to-LiDAR place recognition framework that enforces local similarities in a self-supervised manner and effectively brings global context from images and LiDAR scans into a shared feature space. Specifically, VXP is trained in three stages: first, we deploy a visual transformer to compactly represent input images. Secondly, we establish local correspondences between image-based and point cloud-based feature spaces using our novel geometric alignment module. We then aggregate local similarities into an expressive shared latent space. Extensive experiments on the three benchmarks (Oxford RobotCar, ViViD++ and KITTI) demonstrate that our method surpasses the state-of-the-art cross-modal retrieval by a large margin. Our evaluations show that the proposed method is accurate, efficient and light-weight. Our project page is available at: \href{https://yunjinli.github.io/projects-vxp/}{https://yunjinli.github.io/projects-vxp/}.

Supplementary Material: pdf

Submission Number: 84

Loading