Learning Joint Appearance and Shape Co-Representations for Co-Saliency Detection

Published: 01 Jan 2025, Last Modified: 05 Nov 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Existing leading Co-saliency Detection (CoD) framework aims to segment the co-salient objects by learning the consensus visual representation of the foreground objects. However, despite different categories, some distractors may have similar appearance to the co-salient objects, such as Apples vs. Bananas have similar color and textures. This makes it challenging to distinguish the distractors only through learning the co-salient object appearance representations. To address this issue, we propose a joint appearance and shape co-representation learner for CoD, dubbed as ASCoD. The ASCoD is composed of a Co-Appearance learning Module (CoAM) and a Co-Shape learning Module (CoSM). The CoAM first learns a co-salient object appearance embedding that encodes the global cross-image and spatial context information. Then, this embedding is set as a co-appearance prototype, which guides the model to enhance the features to highlight the co-salient object regions. Afterwards, we design the CoSM that is a cross-attention module, among which the key and the value encode the shape information from a set of salient tokens dynamically selected by a Co-Shape Prototype generation Module (CSPM). Finally, through jointly optimizing the cascaded CoAM and CoSM, the optimal appearance and shape co-representations are achieved, marrying the merits of both appearance and shape co-representations that are not only robust to co-salient objects appearance variations, but also can well discriminate the co-salient objects from the distractors with similar appearance. Extensive evaluations on three challenging benchmarks including CoCA, CoSOD3k and CoSal2015, demonstrate superiority of the ASCoD to a variety of state-of-the-art CoD methods.
Loading