CIDRA-Net: Cross-modal interaction fusion network with distribution-relation awareness for robust 3D object detection

Published: 01 Jan 2025, Last Modified: 05 Nov 2025Neural Networks 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: 3D object detection is crucial for autonomous driving, enabling accurate object classification and localization in the real world. Existing methods typically rely on basic element-wise operations to fuse multi-modal features from point clouds and images, limiting the effective learning of camera semantics and LiDAR spatial information. Additionally, the inherent sparsity of point clouds leads to distribution imbalances in receptive fields, and the complexity of 3D objects conceals implicit relational contexts. To address these limitations, we propose CIDRA-Net, a cross-modal interaction fusion network with distribution-relation awareness. First, we introduce a region cross-modal interaction fusion (RCIF) module that combines LiDAR features with camera depth information through dual-modal attention. We then separate and enhance two distribution-level features using a dual-branch distribution perception (DBDP) module to learn point distributions. Additionally, a global-local relation mining (GLRM) strategy is employed to capture both local and global contextual information for better object understanding and refined regression tasks. Our approach achieves state-of-the-art performance on the nuScenes and KITTI benchmarks while demonstrating strong generalization across backbones and robustness against sensor errors.
Loading