MVSE-Net: A Multi-View Deep Network With Semantic Embedding for LiDAR Place Recognition

Jinpeng Zhang, Yunzhou Zhang, Lei Rong, Rui Tian, Sizhan Wang

Published: 01 Jan 2024, Last Modified: 02 Mar 2025IEEE Trans. Intell. Transp. Syst. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Place recognition is a critical technology in robot navigation and autonomous driving, remains challenging due to inefficient point cloud computation, limited feature representation capability, and poor robustness to long-term environmental changes. We propose MVSE-Net, a feature extraction network with embedded semantic information for multi-view feature fusion. MVSE-Net can convert point cloud data acquired by LiDAR in real time into global descriptors for retrieval. Processing a point cloud by projecting it onto a 2D image can greatly improve computational efficiency. We projected the point cloud into a range-view (RV) image and a bird’s-eye-view (BEV) image in forward and top view, respectively. The semantic segmentation network is then used to process the RV image, and the feature extraction part of the semantic model is connected to the transformer attention module to further refine the features for the place recognition task. The point cloud containing the semantic segmentation results is then converted into a semantic BEV image, and the multi-channel BEV image is processed using a group convolutional network. Finally, the features of the two branches are fused into a global feature representation by post-fusion. Our experiments on three publicly available datasets demonstrate that MVSE-Net exhibits high recall and strong generalization in LiDAR place recognition, outperforming previous state-of-the-art methods.