Lightweight three-stream encoder-decoder network for multi-modal salient object detection

Junzhe Lu, Tingyu Wang, Bin Wan, Qiang Zhao, Shuai Wang, Yaoqi Sun, Yang Zhou, Chenggang Yan

Published: 01 Jan 2025, Last Modified: 02 Aug 2025J. Vis. Commun. Image Represent. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Salient object detection technique can identify the most attractive objects in a scene. In recent years, multi-modal salient object detection (SOD) has shown promising prospects. However, most of the existing multi-modal SOD models ignore modal size and computational cost in pursuit of comprehensive cross-modality feature fusion. To enhance the feasibility of high accuracy model in practical applications, we propose a Lightweight Three-stream Encoder–Decoder Network (TENet) for multi-modal salient object detection. Specifically, we design three decoders to explore saliency clues embedded in different multi-modal features and leverage a hierarchical decoding structure to alleviate the negative effects of low-quality images. To reduce the difference among modalities, we propose a lightweight modal information-guided fusion (MIGF) module to enhance the correlation between RGB-D and RGB-T modalities, thus laying the groundwork for triple-modal fusion. Furthermore, to utilize multi-scale information, we propose the semantic interaction (SI) module and the semantic feature enhancement (SFE) module to integrate specific hierarchical information embedded in high- and low-level features. Extensive experiments on the VDT-2048 dataset show that TENet has a model size of 37 MB, an inference speed of 38FPS, and achieves comparable accuracy to 16 state-of-the-art multi-modal methods.