Meta CLIP 2: A Worldwide Scaling Recipe

Yung-Sung Chuang; Yang Li; Dong Wang; Ching-Feng Yeh; Kehan Lyu; Ramya Raghavendra; James R. Glass; LIFEI HUANG; Jason E Weston; Luke Zettlemoyer; Xinlei Chen; Zhuang Liu; Saining Xie; Wen-tau Yih; Shang-Wen Li; Hu Xu

Meta CLIP 2: A Worldwide Scaling Recipe

Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James R. Glass, LIFEI HUANG, Jason E Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li, Hu Xu

Published: 18 Sept 2025, Last Modified: 11 Dec 2025NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: CLIP, curation, multimodal, web, multilingual, data curation, pretraining, visual-language, contrastive learning, long-tail distribution, scaling

TL;DR: We generalize CLIP training to worldwide web-scale, with +0.8% better than English only counterpart on zero-shot ImageNet classification (no compromise), SoTA on zero-shot multilingual: 57.4% on CVQA and 50.2% on Babel-ImageNet.

Abstract: Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP's training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., "curse of multilinguality" that is common in LLMs. Here, we present Meta CLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, Meta CLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval. Code and model are available at https://github.com/facebookresearch/MetaCLIP.

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 14945

Loading