Abstract: Salient object detection (SOD) is a critical task in computer vision, aimed at identifying visually striking regions within images. Existing SOD methods predict saliency maps in a supervised manner that heavily relies on labels. These methods have the following challenges: (1) Poor boundary detection when the salient objects closely resemble the backgrounds; (2) High false positives caused by more focus on objects and less on the surroundings. Therefore, it is crucial to develop better solutions to improve the comprehensiveness and precision of SOD results. Inspired by findings from a pilot study, which revealed that supervised learning tends to focus on prominent regions but neglects background information around objects, while self-supervised learning captures more comprehensive details, we introduce self-supervised contrast learning into the SOD framework. We design image-level contrast learning and pixel-level contrast learning for the SOD models with Token to Token Vision Transformer (T2T) and Vision Graph Neural Network (ViG) backbone. In fact, our approach is backbone-agnostic and can be applied as a plugin to any model. We conduct comprehensive comparison and ablation experiments on both RGB natural image datasets and medical image datasets to evaluate our method, the experimental results demonstrate that our method outperforms state-of-the-art methods consistently. Most importantly, our method not only provides a new perspective for the SOD task but also shows a new paradigm for other dense prediction tasks. Code is available at https://github.com/msctransu/SCL_SOD.git.
Loading