DMFusion: LiDAR-camera fusion framework with depth merging and temporal aggregation

Xinyi Yu, Ke Lu, Yang Yang, Linlin Ou

Published: 01 Jan 2024, Last Modified: 13 Nov 2024Appl. Intell. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multimodal 3D object detection is an active research topic in the field of autonomous driving. Most existing methods utilize both camera and LiDAR modalities but fuse their features through simple and insufficient mechanisms. Additionally, these approaches lack reliable positional and temporal information due to their reliance on single-frame camera data. In this paper, a novel end-to-end framework for 3D object detection was proposed to solve these problems through spatial and temporal fusion. The spatial information of bird’s-eye view (BEV) features is enhanced by integrating depth features from point clouds during the conversion of image features into 3D space. Moreover, positional and temporal information is augmented by aggregating multi-frame features. This framework is named as DMFusion, which consists of the following components: (i) a novel depth fusion view transform module (referred to as DFLSS), (ii) a simple and easily adjustable temporal fusion module based on 3D convolution (referred to as 3DMTF), and (iii) a LiDAR-temporal fusion module based on channel attention mechanism. On the nuScenes benchmark, DMFusion improves mAP by 1.42% and NDS by 1.26% compared with the baseline model, which demonstrates the effectiveness of our proposed method. The code will be released at https://github.com/lilkeker/DMFusion.