MENet: Multi-Modal Mapping Enhancement Network for 3D Object Detection in Autonomous Driving

Moyun Liu, Youping Chen, Jingming Xie, Yijie Zhu, Yang Zhang, Lei Yao, Zhenshan Bing, Genghang Zhuang, Kai Huang, Joey Tianyi Zhou

Published: 2024, Last Modified: 09 Apr 2025IEEE Trans. Intell. Transp. Syst. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: To achieve more accurate perception performance, LiDAR and camera are gradually chosen to improve 3D object detection simultaneously. However, it is still a non-trivial task to build an effective fusion mechanism, and this is hindering the development of multi-modal based method. Especially, the mapping relationship construction between two modalities is far from fully explored. Canonical cross-modal mapping suffers from failure when the calibration matrix is incorrect, and it also greatly wastes the amount and density of RGB image information. This paper aims to extend the traditional one-to-one alignment relationship between LiDAR and camera. For all projected point clouds, we enhance their cross-modal mapping relationship through aggregating color-texture related feature and shape-contour related feature. Further, a mapping pyramid is proposed to leverage the semantic representation of the image feature at different stages. Based on the above mapping enhancement strategies, our method increases the engagement rate of image. Finally, we design a fusion module based on an attention mechanism to improve the point cloud feature with the auxiliary image feature. Extensive experiments on the KITTI dataset and SUN-RGBD dataset show that our model achieves satisfactory 3D object detection, especially for categories with sparse point clouds compared with other multi-modal fusion networks.