CAT-Seg: Cost Aggregation for Open-vocabulary Semantic Segmentation

20 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Open-vocabulary semantic segmentation
Abstract: In this paper, we reinterpret the challenge of open-vocabulary semantic segmentation, where each pixel in an image is labeled with a wide range of text descriptions, as a correspondence problem focusing on the optimal text matching for each pixel. Addressing the limitations of conventional region-to-text matching approaches, we introduce a novel framework, CAT-Seg, grounded on the principles of cost aggregation methods in visual correspondence tasks. This framework refines the initial matching scores between dense image and text embeddings, leveraging a Transformer-based module for cost aggregation, further enhanced with embedding guidance. Notably, by operating on cosine similarity instead of manipulating embeddings directly, our approach enables the end-to-end fine-tuning of the CLIP model for pixel-level tasks, while yielding superior zero-shot capabilities. Empirical evaluations show our method's superior performance, achieving state-of-the-art results across open-vocabulary benchmarks, practical computational efficiency, and robustness for various domains, underscoring its potential for a wide range of open-vocabulary semantic segmentation applications.
Supplementary Material: pdf
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2373
Loading