Abstract: Depth information for RGB-D Salient Object Detection(SOD) is important and conventional deep models are usually relied on the CNN feature extractors. The long-range contextual dependencies, dense modeling on the saliency decoder, and multi-task learning assistance are usually ignored. In this work, we propose a Dual Swin-Transformer-based Mutual Interactive Network (DTMINet), aiming to learn contextualized, dense, and edge-aware features for RGB-D SOD. We adopt the Swin-Transformer as the visual backbone to extract contextualized features. A self-attention-based Cross-Modality Interaction module is proposed to strengthen the visual backbone for cross-modal interaction. In addition, a Gated Modality Attention module is designed for cross-modal fusion. At different decoding stages, enhanced with dense connections and progressively merge the multi-level encoding features with the proposed Dense Saliency Decoder. Considering the depth quality issue, a Skip Convolution module is introduced to provide guidance to the RGB modality for the saliency prediction. In addition, we add the edge prediction to the saliency predictor to regularize the learning process. Comprehensive experiments on five standard RGB-D SOD benchmark datasets over four evaluation metrics demonstrate the superiority of the proposed method.
Loading