YouCLIP: Advancing Multilingual Cross-Modal Learning with Efficient Training.

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: CLIP; Vision-Language Pre-training; Non-English CLIP
TL;DR: The most powerful Chinese CLIP model was built using an efficient method
Abstract: Since the advent of vision-language pretraining, the CLIP model has become a foundational model for many downstream tasks. However, most of the advanced CLIP models available today are trained primarily on English, making them poorly suited for other languages. This limits accessibility for countries where other languages are dominant. Given that training CLIP models requires vast amounts of GPU resources and data, which most countries lack due to the absence of companies on the scale of Google or OpenAI, this paper proposes an efficient and straightforward three-stage fine-tuning method, which allows for the conversion of the most powerful English CLIP model into models for other languages. In these three stages of training, the first stage focuses on aligning the embedding layer, followed by token fusion in the second stage, and finally contrastive learning fine-tuning in the third stage. Meanwhile, to improve data quality, we propose a translation filtering model to filter the data. In this work, we target Chinese as the language of interest and name the resulting model YouCLIP, which is currently the most powerful Chinese CLIP model, significantly outperforming previous models across all Chinese benchmarks. For example, YouCLIP improves the text-to-image Recall@1 score on the COCO-CN dataset from 63.4 to 73.1. Additionally, YouCLIP retains strong English capabilities, achieving a Top-1 accuracy of 76.9 on ImageNet. Despite these impressive results, YouCLIP requires the least amount of training resources compared to other Chinese CLIP models. All models and code for YouCLIP will be open-sourced.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9636
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview