CSDNet: Contrastive Similarity Distillation Network for Multi-lingual Image-Text Retrieval

Published: 01 Jan 2023, Last Modified: 08 Apr 2025ICIG (3) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Cross-modal image-text retrieval is a crucial task in the field of vision and language, aimed at retrieving the relevant samples from one modality as per the given user expressed in another modality. While most methods developed for this task have focused on English, recent advances expanded the scope of this task to the Multi-lingual domain. However, these methods face challenges due to the limited availability of annotated data in non-English languages. In this work, we propose a novel method that leverages an English pre-training model as a teacher to improve Multi-lingual image-text retrieval performance. Our method trains a student model that produces better Multi-lingual image-text similarity scores by learning from the English image-text similarity scores of the trained teacher. We introduce the contrastive loss to align the two different representations of the image and text, and the Contrastive Similarity Distillation loss to align the Multi-lingual image-text distribution of the student with that of the English teacher. We evaluate our method on two popular datasets, i.e., MS-COCO and Flickr-30K, and achieve state-of-the-art performance. Our approach shows significant improvement over existing methods and has potential for practical applications.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview