CollabLearn: Propelling Weakly-Supervised Referring Image Segmentation Through Collaboration Between Semantics and Details

Chao Jiang, Yuqiu Kong, Mengnan Zhao, Lihe Zhang, Baocai Yin

Published: 2025, Last Modified: 28 Feb 2026IEEE Trans. Multim. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This work presents a weakly supervised referring image segmentation method, named CollabLearn, that segments objects described by free-form referring expression utilizing solely image-text pairs. Existing methods suffer from incorrect localization of referring expressions due to the lack of high-level semantics in cross-modal alignment or rough segmentation of referenced objects stemming from the absence of low-level details. To address these issues, we propose an innovative framework for generating cross-modal features encompassing both high-level semantics and low-level details via two fusion modules: a semantic awareness module and a detail cognition module. Each of these modules generates an activation map, and they mutually correct each other through a collaborative learning strategy. Specifically, the semantic awareness module performs in-depth cross-modal interaction and achieves accurate localization in a top-down manner. The detail cognition module facilitates the segmentation of entire objects in a bottom-up manner. A collaborative learning strategy is designed to enable interaction between these two modules, enforcing sufficient vision-language alignment. Experiments on three benchmarks demonstrate that CollabLearn consistently outperforms state-of-the-art weakly supervised methods.
Loading