Multi-modal feature integration network for Visible-Depth-Thermal salient object detection

Fengyv Cui, Xiaofei Zhou, Liuxin Bao, Bin Wan, Ran Shi, Qiang Chen, Jiyong Zhang

Published: 01 Jan 2025, Last Modified: 02 Aug 2025Eng. Appl. Artif. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In recent years, the task of salient object detection in multi-modal scenarios has attracted more and more attention, where the increase of modalities is beneficial for improving the detection performance of models. However, though the existing saliency models have achieved encouraging performance, they overlook the unbalanced information content between visible modality and other auxiliary modalities (i.e., depth and thermal modalities), and lack the full utilization of multi-level features. This will lead to insufficient multi-modal fusion and multi-level integration. Therefore, in this paper, we propose a multi-modal feature integration network (MFINet) for Visible-Depth-Thermal (VDT) salient object detection (SOD), which contains three key modules. Firstly, we utilize the three-modal feature fusion (TMFF) module to enhance and fuse the multi-modal features by emphasizing effective feature channels and enlarging the receptive fields of features, where we further emphasize the visible cues. Secondly, we present a neighborhood layer feature enhancement (NLFE) module, which can utilize the complementary information from adjacent TMFF modules to enhance the decoder features by using different spatial attention strategies. Thirdly, a multi-level cascade feature integration (MCFI) module is proposed to aggregate the multi-level decoder features in a cascade way, acquiring the final high-quality saliency maps. Comprehensive experiments conducted on the VDT-2048 dataset demonstrate that our model outperforms the state-of-the-art models in terms of all evaluation metrics. The code is available at https://github.com/banjamn/MFINet.