Segment, Select, Correct: A Framework for Weakly-Supervised Referring Segmentation

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: referring image segmentation, weakly-supervised learning, computer vision
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose a three-stage framework to perform referring image segmentation without mask supervision which significantly outperforms existing zero-shot and weakly-supervised baselines, approaching the performance of the fully-supervised SOTA.
Abstract: Referring Image Segmentation (RIS) - the problem of identifying objects in images through natural language sentences - is a challenging task currently mostly solved through supervised learning. However, while collecting referred annotation masks is a time-consuming process, the few existing weakly-supervised and zero-shot approaches fall significantly short in performance compared to fully-supervised learning ones. To bridge the performance gap without mask annotations, we propose a novel weakly-supervised framework that tackles RIS by decomposing it into three steps: obtaining instance masks for the object mentioned in the referencing instruction (*segment*), using zero-shot learning to select a potentially correct mask for the given instruction (*select*), and bootstrapping a model which allows for fixing the mistakes of zero-shot selection (*correct*). In our experiments, using only the first two steps (zero-shot segment and select) outperforms other zero-shot baselines by as much as $19\\%$, while our full method improves upon this much stronger baseline and sets the new state-of-the-art for weakly-supervised RIS, reducing the gap between the weakly-supervised and fully-supervised methods in some cases from around $33\\%$ to as little as $14\\%$.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5796
Loading