Multi-sem fusion: multimodal semantic fusion for 3d object detection

Zhi-Xin Yang

Published: 10 Apr 2024, Last Modified: 05 Mar 2025https://ieeexplore.ieee.org/abstract/document/10497135/EveryoneCC BY 4.0

Abstract: LIDAR and camera fusion techniques are promising for achieving 3D object detection in autonomous driving. Most multi-modal 3D object detection frameworks integrate semantic knowledge from 2D images into 3D LiDAR point clouds to enhance detection accuracy. Nevertheless, the restricted resolution of 2D feature maps impedes accurate re-projection and often induces a pronounced boundary-blurring effect, which is primarily attributed to erroneous semantic segmentation. To address these limitations, we present the Multi-Sem Fusion (MSF) framework, a versatile multi-modal fusion approach that employs 2D/3D semantic segmentation methods to generate parsing results for both modalities. Subsequently, the 2D semantic information undergoes re-projection into 3D point clouds utilizing calibration parameters. To tackle misalignment challenges between the 2D and 3D parsing results, we introduce an Adaptive Attention-based Fusion (AAF) module to fuse them by learning an adaptive fusion score. Then the point cloud with the fused semantic label is sent to the following 3D object detectors. Furthermore, we propose a Deep Feature Fusion (DFF) module to aggregate deep features at different levels to boost the final detection performance. The effectiveness of the framework has been verified on two public large-scale 3D object detection benchmarks by comparing them with different baselines. The experimental results show that the proposed fusion strategies can significantly improve the detection performance compared to the methods using only point clouds and the methods using only 2D semantic information. Moreover, our approach seamlessly integrates as a plug-in within any detection framework.