Abstract: Existing leading deep learning-based Co-saliency Detection (CoD) methods often learn the consensus features from the input image group without considering the complexity of each image. Despite the demonstrated success, the input images may contain hard samples with high complexity, e.g., those containing distractors that have similar appearance but different semantics to the co-salient objects. This is prone to mislead the learned model to treat these distractors as co-salient objects, leading to classification ambiguity. To address this issue, this paper presents an easy-to-hard instance-level feature Fusion framework for CoD, termed E2HCoD. The E2HCoD exploits the instance-level co-salient object consensus cues from the easy samples as reliable guidance to accurately fuse the co-salient object features in the hard samples. First, we design a Feature Filtering Module (FFM) that evaluates image complexity by integrating entropy, variance, texture, and edge density cues, allowing the model to select the easy samples with relatively easy backgrounds. Then, we develop an Easy-instance Embedding Branch (EEB), which accurately segments the co-salient object masks from the easy samples as the instance-level guidance to learn the accurate co-salient object consensus cues. Then, with the consensus knowledge from the easy samples as guidance, we construct an Easy-instance guided Fusion Branch (EFB), which fully interacts with the consensus features from the hard samples via a cross-attention mechanism, yielding the refined features that highlight the co-salient objects while suppressing the distractors. Finally, the refined features are fed into the decoder, generating a high-quality CoD prediction. Extensive experiments demonstrate that the proposed E2HCoD achieves state-of-the-art performance on CoSal2015, CoCA, and CoSOD3k.
Loading