Language-Guided Semantic Alignment for Co-saliency Detection

Chuang Ding, Yang Wu, Huihui Song, Kaihua Zhang, Xu Zhang, Zhenhua Guo

Published: 01 Jan 2024, Last Modified: 05 Nov 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Previous pure vision paradigm for co-saliency detection (COD) predominantly employs supervised training. The supervisory signals often consist of binary masks or a combination of masks and category labels. However, constrained by limited training samples, these models often suffer from overfitting issue, struggling to generalize to unseen samples. To this end, this paper presents the constrative language-image pretraining-COD (CLIP-COD), a novel language-guided semantic alignment paradigm for COD. The primary objective is to leverage CLIP for aligning concepts between language and images, where the alignment can effectively leverage the powerful language understanding capability of CLIP and transfer its knowledge to image domain, thereby enhancing the model’s zero-shot generalization ability for COD. Firstly, we propose a semantic alignment branch (SAB) that can learn rich knowledge for comprehending images globally. Meanwhile, the SAB can narrow the gap in high-dimensional feature space between the language and image features, transferring the powerful semantic knowledge from CLIP to our model. Subsequently, we devise an intra-group multi-fusion module (IMM) to capture features that integrate group knowledge as dense prompts, providing spatial localization information for subsequent fine segmentation. Finally, we input sparse language prompts and dense mask cues into the pre-trained SAM decoder to obtain the final COD results. Additionally, we further design a transfer optimization adaptor, which can reduce the model training scale, saving computing resource and cost greatly. Extensive experiments on three benchmark datasets, including CoSal2015, CoCA, and CoSOD3k, demonstrate the superior performance of our CLIP-COD to a variety of state-of-the-art methods.