Abstract: Learning to segment images by purely relying on the image-text alignment from web data can lead to sub-optimal performance due to noise in the training data. The noise comes from the samples where the associated text does not or only partially describes the image's visual content. Instead, this work proposes a novel loss function termed SimCon, which compares an image jointly to images and texts while accounting for intra-modal similarities to determine the appropriate set of semantic positives. Further, using multiple views of the image (created synthetically) and combining the SimCon loss with it makes the training more robust. This version of the loss is termed MV-SimCon. The empirical results demonstrate that using the proposed loss function leads to consistent improvements on zero-shot, text supervised semantic segmentation and outperforms state-of-the-art by $+3.0\%$, $+3.3\%$ and $+6.9\%$ on PASCAL VOC, PASCAL Context and MSCOCO, respectively. With test time augmentations, we set a new record by improving these results further to $58.7\%$, $26.6\%$, and $33.3\%$ on PASCAL VOC, PASCAL Context, and MSCOCO, respectively. In addition, using the proposed loss function leads to robust training and faster convergence.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jia-Bin_Huang1
Submission Number: 1935
Loading