CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

ACL ARR 2025 May Submission458 Authors

12 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synthetic captions. Firstly, by observing a strong inverse effect in learning with synthetic captions---the short synthetic captions can generally lead to MUCH higher performance than full-length ones---we therefore fed only partial synthetic captions to the text encoder. Secondly, we incorporate an autoregressive captioner to mimic the recaptioning process---by conditioning on the paired image input and web-crawled text description, the captioner learns to predict the full-length synthetic caption generated by advanced MLLMs. Experiments show that our framework significantly improves zero-shot performance in cross-modal retrieval tasks, setting new SOTA results on MSCOCO and Flickr30K. Moreover, such trained vision encoders can enhance the visual capability of LLaVA, showing strong improvements on a range of MLLM benchmarks.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: cross-modal pretraining, image text matching, multimodality

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 458

Loading