Multimodal 3D Object Detection Based on Sparse Interaction in Internet of Vehicles

Hui Li, Tongao Ge, Keqiang Bai, Gaofeng Nie, Lingwei Xu, Xiaoxue Ai, Song Cao

Published: 2025, Last Modified: 06 Nov 2025IEEE Trans. Veh. Technol. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Combining the Internet of Vehicles with autonomous driving visual perception can enhance vehicle intelligence. Vehicles use the 3D object detection algorithm to perceive their surroundings and share detection results with other vehicles using the internet of vehicles technology, improving the efficiency of intelligent transportation systems. Multimodal information fusion of LiDAR and cameras can improve the performance of 3D object detection. However, the different modality information is inhomogeneous, multimodal 3D object detection still has challenges, such as difficult semantic alignment of modal elements and inadequate fusion. To mitigate these challenges, we first propose the sparse interaction with centroid query (SICQ) for voxel-level features from different modalities, which aligns different modal semantic information through more precise and fine-grained interaction. Then, we propose the dense fusion with multi-scale masked attention (DFMMA), using multi-scale feature masks from bird's-eye-view (BEV)-level multimodal features to improve performance for small object feature perception. Finally, we propose the multimodal grid encoder with positional information (MGEPI), through positional information implicitly guiding and the transformer-based attention mechanism for grid-level features, improves the perception of detection scene context spatial information and enhances the robustness of the algorithm. Additionally, this paper performs comprehensive experiments on the popular KITTI dataset and demonstrates that our algorithm has superior 3D object detection performance.

External IDs:dblp:journals/tvt/LiGBNXAC25