Masked Visual Pre-training for RGB-D and RGB-T Salient Object Detection

Published: 01 Jan 2024, Last Modified: 14 May 2025PRCV (5) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recent advances in deep learning boost the performance of RGB-D salient object detection (SOD) at the expense of using weights pre-trained on larger-scale labeled RGB datasets. To escape labor-intensive labeling, RGB-D self-supervised learning based on mutual prediction has been proposed to pre-train networks for the RGB-D SOD task. However, its two-stream approach is cumbersome and is far from optimal when transferred to the downstream task. In this paper, we present a neat and effective masked self-supervised pre-training scheme for the RGB-D SOD task. Specifically, we develop a single-stream encoder-decoder framework, with an encoder that operates only on the sampled RGB-D patches, and a joint decoder that reconstructs original RGB images and depth maps simultaneously. This self-supervised pre-training bootstraps our model to learn uni-modal representations and cross-modal synergies, thereby providing a strong initialization for the downstream task. Moreover, we design a mutually exclusive spatial (MES) sampling strategy to sample RGB and depth patches that share no intersection spatially. This allows the encoder to establish richer cross-modal relationships from more spatial locations. Extensive experiments on six benchmarks show that our approach surpasses previous self-supervised learning methods by large margins, and performs favorably against most SOTA models pre-trained on ImageNet. In addition, our model exhibits high robustness under degraded images and transferable generalization on RGB-T benchmarks.
Loading