Fusion-Mamba for Cross-Modality Object Detection

Wenhao Dong, Haodong Zhu, Shaohui Lin, Xiaoyan Luo, Yunhang Shen, Guodong Guo, Baochang Zhang

Published: 01 Jan 2025, Last Modified: 05 Nov 2025IEEE Transactions on MultimediaEveryoneRevisionsCC BY-SA 4.0

Abstract: Cross-modality object detection aims to fuse complementary information from different modalities to improve model performance, which achieves a wider range of applications. However, traditional cross-modality fusion methods, based on CNN or Transformer, inadequately address the issue of pseudo-target information, which causes model attention dispersion to degrade object detection performance. In this paper, we investigate a novel cross-modality fusion approach by associating cross-modal features in a hidden state space based on an improved Mamba with a gating attention mechanism. We propose the Fusion-Mamba Block(FMB), designed to map cross-modal features into a hidden state space for interaction, thereby refining the model’s attention on true target areas and enhancing overall performance. The FMB comprises two key modules: State Space Channel Swapping (SSCS) module, which facilitates the fusion of shallow features, and Dual State Space Fusion (DSSF) module, which enables deep fusion and effectively suppresses pseudo-target information within the hidden state space. Our proposed method outperforms state-of-the-art approaches, achieving improvements of 5.9%, 3.5% and 2.1% mAP on $M^{3}$FD, DroneVehicle and FLIR-Aligned, respectively. To the best of our knowledge, this work establishes a new baseline for cross-modality object detection, providing a robust foundation for future research in this area.

External IDs:doi:10.1109/tmm.2025.3599020