SemCLIP: Aligning vision-language encoder models to semantic spaces for stability in retrieval

ICLR 2025 Conference Submission12626 Authors

28 Sept 2024 (modified: 22 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Semantic-preserving queries, Vision-language encoder models, Stability of retrieval, joint embeddings
TL;DR: SemCLIP: Aligning vision-language encoder models
Abstract: Vision-language models (VLM) bring image and textual representations close together in a joint embedding space to tackle many tasks ranging from image captioning to text-to-image retrieval. For such models to be reliably used in cloud vector stores, it is important to have a stable association between images and text such that synonymous queries bring up the same images or have a high degree of overlap. Current textual representations based on transformer models used to build the VLMs cannot adequately capture linguistic similarities to ensure such stability. In this paper we develop a database of linguists-curated similarity list of words derived from Wordnet, and train a semantics preserving textual embedding. We then train an alignment transformation to map existing VLM (CLIP) embeddings to bring synonymous embeddings closer while also preserving image-text similarities. The alignment transform is learned from textual embeddings alone thus avoiding large-scale retraining of VLMs from image-text pairs. This simple method outperforms other methods of creating image-joint text embeddings including even those by fine-tuning the encoders using the same synonyms lists. Results of analysis and comparison on multiple benchmark datasets is indicating both stable and improved quality of retrieval. The dataset of similarity lists and the semantics-preserve textual embedding itself can be employed in a variety of ways for other downstream tasks and will be made available for other researchers.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12626
Loading