Conditional variational autoencoder for vision-language models

20 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Vision-language, domain genrealization, CLIP
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the connection between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the class-specific text description is matched against the whole image. We recognize that this whole image matching is not effective since images from the same class often contain a set of different semantic objects, and an object further consists of a set of semantic parts or concepts. Individual semantic parts or concepts may appear in image samples from different classes. To address this issue, in this paper, we develop a new method called cross-modal Concept Discovery and Reasoning (CoDaR). Using the powerful text-image correlation capability of CLIP, our method automatically discovers a large set of distinctive visual concepts from images using a set of semantic text concepts. Based on these visual concepts, we construct a discriminative representation of images and learn a concept reasoning network to perform downstream image classification tasks, such as few-shot learning and domain generalization. Extensive experimental results demonstrate that our CoDaR method is able to improve the performance upon the current state-of-the-art methods by large margins, for example, by up to 8.0\% improvement on few-shot learning and by up to 1.3\% for domain generalization.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2752
Loading