SwinZS3: Zero-Shot Semantic Segmentation with a Swin TransformerDownload PDF

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone
Keywords: zero shot semantic segmentation, deep learning, transformer
Abstract: Zero-shot semantic segmentation (ZS3) aims at learning to classify the never-seen classes with zero training samples. Convolutional neural networks (CNNs) have recently achieved great success in this task. However, their limited attention ability constraints existing network architectures to reason based on word embeddings. In this light of the recent successes achieved by Swin Transformers, we propose SwinZS3, a new framework exploiting the visual embeddings and semantic embeddings on joint embedding space. The SwinZS3 combines a transformer image encoder with a language encoder. The image encoder is trained by pixel-text score maps using the dense language-guided semantic prototypes which are computed by the language encoder. This allows the SwinZS3 could recognize the unseen classes at test time without retraining. We experiment with our method on the ZS3 standard benchmarks (PASCAL VOC and PASCAL Context) and the results demonstrate the effectiveness of our method by showing the state-of-art performance.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
Supplementary Material: zip
11 Replies

Loading