Abstract: Segmenting the most prominent objects in a scene using a pair of color and depth images requires the model to learn effective multimodal fusion. Despite an explosive number of recent studies, a significant problem remains underestimated: datasets have been labeled from people’s subjectivity, thus lacking consistency in determining the most prominent objects, while one picture can contain numerous sets of salient objects. To tackle this issue, we propose a multi-ground truth approach for RGB-D Saliency Detection (dubbed S-MultiMAE) that combines multi-perspective tokens to guide the model to create various desirable predictions and a masked autoencoding pretraining task (inherits MultiMAE) to achieve a superior multi-model synthesis of color and depth images. We conducted extensive analyses on both multi- and single-ground truth benchmarks on the COME15K dataset to demonstrate the effectiveness of our proposed method. The source code is available at https://github.com/thinh-re/s-multimae.
Loading