Abstract: Multi-modality images with complementary cues can significantly improve the performance of salient object detection (SOD) methods in challenging scenes. However, existing methods are specially designed for RGB-D or RGB-T SOD in general, thus it is necessary to bridge the gap to develop a unified SOD framework for processing various multi-modality images. To address this issue, we propose a unified multi-modality interaction fusion framework for RGB-D and RGB-T SOD, named UMINet. We deeply investigate the differences between appearance maps and complementary images and design an asymmetric backbone to extract appearance features and complementary cues. For the complementary cues branch, a complementary information aware module (CIAM) is proposed to perceive and enhance the weights of complementary modality features. We also propose a multi-modality difference fusion (MDF) block to fuse cross-modality features. This MDF block simultaneously considers the differences and consistency between the appearance features and complementary features. Furthermore, to promote the rich contextual dependencies and integrate cross-level multi-modality features, we design a mutual refinement decoder (MRD) to progressively predict salient results. The MRD consists of three reverse perception blocks (RPB) and five sub-decoders. Extensive experiments are provided to indicate the substantial improvement achieved by the proposed UMINet over the existing state-of-the-art (SOTA) models on six RGB-D SOD datasets and three RGB-T SOD datasets.
Loading