Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Han Xiao; Georgios Mastrapas; Bo Wang

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Han Xiao, Georgios Mastrapas, Bo Wang

Published: 18 Jun 2024, Last Modified: 05 Sept 2024MFM-EAI@ICML2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: clip, multimodal, embeddings, information retrieval

TL;DR: We propose a novel, multi-task contrastive training for CLIP-like model and achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

Abstract: Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the JinaCLIP model and achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

Submission Number: 18

Loading