SemCLIP: Aligning vision-language encoder models to semantic spaces for stability in retrieval

Tanveer Syeda-mahmood; Niharika S. D'Souza; Satyananda Kashyap; Ken C. L. Wong; Raziuddin Mahmood; Luyao Shi; Ashutosh Jadhav; Hongzhi Wang; Joy T Wu; David Beymer

SemCLIP: Aligning vision-language encoder models to semantic spaces for stability in retrieval

Tanveer Syeda-mahmood, Niharika S. D'Souza, Satyananda Kashyap, Ken C. L. Wong, Raziuddin Mahmood, Luyao Shi, Ashutosh Jadhav, Hongzhi Wang, Joy T Wu, David Beymer

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Semantic-preserving queries, Vision-language encoder models, Stability of retrieval, joint embeddings

TL;DR: SemCLIP: Aligning vision-language encoder models

Abstract: Vision-language models (VLM) bring image and textual representations close together in a joint embedding space to tackle many tasks ranging from image captioning to text-to-image retrieval. For such models to be reliably used in cloud vector stores, it is important to have a stable association between images and text such that synonymous queries bring up the same images or have a high degree of overlap. Current textual representations based on transformer models used to build the VLMs cannot adequately capture linguistic similarities to ensure such stability. In this paper we develop a database of linguists-curated similarity list of words derived from Wordnet, and train a semantics preserving textual embedding. We then train an alignment transformation to map existing VLM (CLIP) embeddings to bring synonymous embeddings closer while also preserving image-text similarities. The alignment transform is learned from textual embeddings alone thus avoiding large-scale retraining of VLMs from image-text pairs. This simple method outperforms other methods of creating image-joint text embeddings including even those by fine-tuning the encoders using the same synonyms lists. Results of analysis and comparison on multiple benchmark datasets is indicating both stable and improved quality of retrieval. The dataset of similarity lists and the semantics-preserve textual embedding itself can be employed in a variety of ways for other downstream tasks and will be made available for other researchers.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12626

Loading