Visual Category Discovery via Linguistic Anchoring

18 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: multi-modal, clustering, category discovery
Abstract: We address the problem of generalized category discovery (GCD) that aims to classify entire images of a partially labeled image collection with the total number of target classes being unknown. Motivated by the relevance of visual category to linguistic semantics, we propose language-anchored contrastive learning for GCD. Assuming consistent relations between images and their corresponding texts in an image-text joint embedding space, our method incorporates image-text consistency constraints into contrastive learning. To perform this process without manual image-text annotations, we assign each image with a corresponding text embedding by retrieving $k$-nearest-neighbor words among a random corpus of diverse words and aggregating them through cross-attention. The proposed method achieves state-of-the-art performance on the standard benchmarks, ImageNet100, CUB, Stanford Cars, and Herbarium19.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1136
Loading