Abstract: Visible–infrared remote sensing object detection aims to achieve all-weather object detection by leveraging the complementary information from paired visible and infrared (RGB–IR) images. However, modality differences and weak alignment often limit its performance. Existing methods largely neglect the frequency discrepancies between modalities and require strict alignment, increasing complexity. To address these challenges, this study proposes a novel mask-guided frequency feature (MGFF) fusion method for RGB–IR object detection in remote sensing. Specifically, we develop a feature frequency decomposition and enhancement module (FDEM) using the wavelet transform (WT) to reduce modality differences between RGB and IR images by restructuring and enhancing their frequency components. In addition, we introduce a mask-guided feature reconstruction (MGFR) module and a feature-guided consistency loss, ensuring that even under weak alignment, the focus remains on integrating the target features from different modalities. Meanwhile, this loss is used to guide the reconstruction of features from different modalities. Finally, we design a multidirectional perception cross-modality fusion module to achieve deep fusion of multimodal information, which enhances object perception from different directions across modalities. Extensive evaluations on the widely recognized RGB–IR remote sensing benchmarks, including DroneVehicle and VEDAI, as well as the RGB–IR pedestrian dataset KAIST, substantiate the effectiveness of the proposed MGFF method. The results consistently demonstrate that the MGFF achieves a superior performance in terms of detection accuracy and robustness compared to existing state-of-the-art approaches.
External IDs:dblp:journals/tgrs/ChenJZYZWHWCZ25
Loading